• I need to organize some data

    From Luc@21:1/5 to All on Tue Dec 19 18:09:49 2023
    Solving problems and even reinventing wheels can be fun, I've done that a
    lot, but sometimes it becomes tiresome and may even become stupid
    pig-headed. So I decided to ask for help again.

    I need to collect the names of all files and directories in a path. We have glob for that.

    OK. Now I need to organize that data. The problem is, there may be many
    ways of organizing data. Talk about skinning cats.

    Maybe the user wants it all sorted by file name, alphabetically. OK. The problem is, maybe the user wants directories listed first. Maybe the user
    wants files listed first. That alone means three possible lists. Not to
    mention reverse order. That's another three lists.

    Then maybe the user wants it all sorted by date. OK. The problem is, maybe
    the user wants newest first, maybe the user wants oldest first. That alone means two possible lists. And again, maybe the user wants directories
    listed first. Maybe the user wants files listed first. Maybe the user wants files and directories mixed. Those combined make six possible lists. And an additional problem. I will certainly use the Unix epoch to sort dates, but
    will display a nicely formatted date string to the user later, so whatever
    is used in the sorting process cannot be used in the display step.

    More or less the same happens with size. I have to run calculations on raw
    byte sizes before I format them as KB, MB, GB etc. After all those
    permutations I mentioned before.

    I've done all that. It works. The problem is, 1. It's 150 lines of code.
    More IFs than ChatGPT. It's big and awkward. 2. I did it in a way that
    sorts lists with strides and assumes that filename is [lindex list 0], size
    is [lindex list 1], and so on. That is going to break because I want to let
    the user change the position of the columns. Maybe the user wants filename-size-date-permission-owner or maybe just filename-date-size or
    maybe filename-owner-date. Who knows? So I need to be able to query the
    data in random orders, according to the user's choice of column
    presence and order.

    So I thought maybe I can use a dict:
    dict create biglist $filelist
    and $filelist would be yet another dict arrangement or maybe an array with
    each file and its corresponding attributes.

    Then I would just have to parse that dict, formatting size and date on the
    fly before displaying them.

    The problem is, it's still a lot of IFs.

    Another problem is, there is a very, very unexpected little obstacle in the way. You would never guess what it is.

    I use an old kernel. And there is a bug in BTRFS. One of my files is listed twice. Every time. Same name, same size, same date, same inode. I talked
    about it in the #btrfs IRC channel and they confirmed the bug. It's old,
    but it exists in old kernels. If it's there, it has to be seen. I can't let
    my application hide it. It's not my job to sweep file system bugs under the carpet. Which means that, counterintuitively, I must not treat filenames as unique. I must assume there may be duplicates, as weird as that sounds. And they must be shown as such. You know what, I think I am going to write code
    to flag that kind of situation very visibly to the user because that's the
    kind of meticulous sad soul that I am. So anyway, I can't index my dict or array structure around file names.

    That is going to make the problem more difficult. In fact, I'm not sure
    where to start.

    Would you like to contribute any ideas? No need for code. Just broad ideas
    on how to organize the data efficiently. With duplicate file names.

    TIA

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Tue Dec 19 22:49:37 2023
    Luc <[email protected]d> wrote:
    Solving problems and even reinventing wheels can be fun, I've done that a lot, but sometimes it becomes tiresome and may even become stupid
    pig-headed. So I decided to ask for help again.

    I need to collect the names of all files and directories in a path. We have glob for that.

    OK. Now I need to organize that data. The problem is, there may be many
    ways of organizing data. Talk about skinning cats.

    Maybe the user wants it all sorted by file name, alphabetically. OK. The problem is, maybe the user wants directories listed first. Maybe the user wants files listed first. That alone means three possible lists. Not to mention reverse order. That's another three lists.

    Why would that mean six lists?

    Create one list of lists, with each sublist being the columns of raw
    data for the files that you want to use.

    Then, create a set of procs that utilize the "-command" interface to
    lsort. One proc sorts by name, one proc by date, one proc by whatever
    else.

    Then when the user changes the sort order, just resort the single
    nested list using the appropriate "sort type" proc.

    Then maybe the user wants it all sorted by date. OK. The problem is, maybe the user wants newest first, maybe the user wants oldest first. That alone means two possible lists. And again, maybe the user wants directories
    listed first. Maybe the user wants files listed first. Maybe the user wants files and directories mixed. Those combined make six possible lists. And an additional problem. I will certainly use the Unix epoch to sort dates, but will display a nicely formatted date string to the user later, so whatever
    is used in the sorting process cannot be used in the display step.

    Internal storage format and external viewing format are two different
    things, and do not (and often are not) the same format. The view is
    derived from the internal data by formatting it as it needs to be
    formatted for the view.

    I've done all that. It works. The problem is, 1. It's 150 lines of code.
    More IFs than ChatGPT. It's big and awkward. 2. I did it in a way that
    sorts lists with strides and assumes that filename is [lindex list 0], size is [lindex list 1], and so on. That is going to break because I want to let the user change the position of the columns.

    lsort has the -index option just for sorting a list of lists.

    And again, the internal storage format is not the same as display format.

    Internal storage format is the same, always. That keeps the code
    simpler.

    If the user swaps around columns then you display the columns in a
    different order to create the view, but the column order in the internal
    format does not change.

    Another problem is, there is a very, very unexpected little obstacle in the way. You would never guess what it is.

    I use an old kernel. And there is a bug in BTRFS. One of my files is listed twice. Every time. Same name, same size, same date, same inode. I talked about it in the #btrfs IRC channel and they confirmed the bug. It's old,
    but it exists in old kernels. If it's there, it has to be seen. I can't let my application hide it. It's not my job to sweep file system bugs under the carpet. Which means that, counterintuitively, I must not treat filenames as unique. I must assume there may be duplicates, as weird as that sounds. And they must be shown as such. You know what, I think I am going to write code to flag that kind of situation very visibly to the user because that's the kind of meticulous sad soul that I am. So anyway, I can't index my dict or array structure around file names.

    Nope, so just don't. Use a nested list of lists, which will store
    duplicates with no problem.

    Note that if you store the "filename" separated from the "path to the
    file" then you have to be able to support duplicate filenames anyway,
    even without a "bug", because two sub directories could have the
    identical filename inside.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Rich on Tue Dec 19 22:52:52 2023
    On Tue, 19 Dec 2023 22:49:37 -0000 (UTC), Rich wrote:

    Note that if you store the "filename" separated from the "path to the
    file" then you have to be able to support duplicate filenames anyway,
    even without a "bug", because two sub directories could have the
    identical filename inside.


    I don't really understand what you mean here because in my case there
    won't be any subdirectories. Each "view" contemplates the flat listing
    inside a directory. There is no recursive action here whatsoever.

    There will be recursive actions when I get to the stage of copying,
    moving and deleting, but I'm not dealing with any sharp edges yet.

    Anyway, let's see the new problem I have now:



    Create one list of lists, with each sublist being the columns of raw
    data for the files that you want to use.

    Then, create a set of procs that utilize the "-command" interface to
    lsort. One proc sorts by name, one proc by date, one proc by whatever
    else.

    Then when the user changes the sort order, just resort the single
    nested list using the appropriate "sort type" proc.


    I never used the -command option before so I had to look it up.

    Ugh! Why are so many examples in the manual so opaque?


    % proc compare {a b} {
    set a0 [lindex $a 0]
    set b0 [lindex $b 0]
    if {$a0 < $b0} {
    return -1
    } elseif {$a0 > $b0} {
    return 1
    }
    return [string compare [lindex $a 1] [lindex $b 1]]
    }

    % lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}
    {1 dingo} {2 banana} {0x2 carrot} {3 apple}


    I copied and changed and ran that many times trying to understand what in
    the world is going on here, but I can't. How does the comparison work?

    Is {3 apple} compared to {0x2 carrot} then {1 dingo} compared to
    {2 banana}?

    Does it stride in leaps of 2 because two arguments were given (a and b)?

    But that doesn't make sense. I can't find a hypothesis that makes sense.
    I can't see the logic.

    I guess {3 apple} is compared with {0x2 carrot}. Actually, just their
    0 indices. So 3 > 1 so return 1. OK, 1. But what is "1" here?


    Holy mackerew. I need to dissect this thing:

    proc compare {a b} {
    puts "begin: a is $a b is $b"
    set a0 [lindex $a 0]
    set b0 [lindex $b 0]
    if {$a0 < $b0} {
    puts "R1 $a0 < $b0, returning -1"
    return -1
    } elseif {$a0 > $b0} {
    puts "R2 $a0 > $b0, returning 1"
    return 1
    }
    puts "two tests failed with $a0 vs. $b0, going to the third"
    puts "R3 [lindex $a 1] [lindex $b 1], returning [string compare [lindex $a 1] [lindex $b 1]] for [lindex $a 1] and [lindex $b 1]"
    return [string compare [lindex $a 1] [lindex $b 1]]
    }
    puts [lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}]


    Output:

    begin: a is 3 apple b is 0x2 carrot
    R2 3 > 0x2, returning 1
    begin: a is 1 dingo b is 2 banana
    R1 1 < 2, returning -1
    begin: a is 0x2 carrot b is 1 dingo
    R2 0x2 > 1, returning 1
    begin: a is 0x2 carrot b is 2 banana
    two tests failed with 0x2 vs. 2, going to the third
    R3 carrot banana, returning 1 for carrot and banana
    {1 dingo} {2 banana} {0x2 carrot} {3 apple}

    So I see the first pair was compared to the second pair, then the third
    pair was compared to the fourth pair... OK...

    Then the second pair is compared to the fourth pair... Why?

    And the results of all the comparisons were 1, -1, 1 and 1. OK.
    And what does that mean?

    Sorry, but I really can't understand the shuffle.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Wed Dec 20 02:41:42 2023
    Luc <[email protected]d> wrote:
    On Tue, 19 Dec 2023 22:49:37 -0000 (UTC), Rich wrote:

    Note that if you store the "filename" separated from the "path to the
    file" then you have to be able to support duplicate filenames anyway,
    even without a "bug", because two sub directories could have the
    identical filename inside.


    I don't really understand what you mean here because in my case there
    won't be any subdirectories. Each "view" contemplates the flat listing
    inside a directory. There is no recursive action here whatsoever.

    I had no idea whether you were building a recursive view or not. But
    if you /were/ building one, and splitting the name away from the rest
    of the path, then with recursion you can (more like 'will') have
    identical "names" (just on different paths).

    Create one list of lists, with each sublist being the columns of raw
    data for the files that you want to use.

    Then, create a set of procs that utilize the "-command" interface to
    lsort. One proc sorts by name, one proc by date, one proc by whatever >>else.

    Then when the user changes the sort order, just resort the single
    nested list using the appropriate "sort type" proc.


    I never used the -command option before so I had to look it up.

    Ugh! Why are so many examples in the manual so opaque?


    % proc compare {a b} {
    set a0 [lindex $a 0]
    set b0 [lindex $b 0]
    if {$a0 < $b0} {
    return -1
    } elseif {$a0 > $b0} {
    return 1
    }
    return [string compare [lindex $a 1] [lindex $b 1]]
    }

    % lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}
    {1 dingo} {2 banana} {0x2 carrot} {3 apple}

    There's nothing opaque above.

    I copied and changed and ran that many times trying to understand what in
    the world is going on here, but I can't. How does the comparison work?

    Note, I don't know what exact algorithm lsort uses, but in a general
    sense, what lsort does is it traverses the list, comparing pairs of
    list elements as it goes, and decides whether to swap them or not,
    based upon the return value of the '-command' proc. This may not match reality, but you should get the idea.

    In the example, the 'list' is more easily viewed if formatted this way
    (with list index values on the left):

    {
    0: {3 apple}
    1: {0x2 carrot}
    2: {1 dingo}
    3: {2 banana}
    }

    Is {3 apple} compared to {0x2 carrot} then {1 dingo} compared to
    {2 banana}?

    No, 2 is compared to 0x2. The list is four elements, each a string
    that is also a valid list. As lsort's algorithm chooses pairs to
    compare, it calls -command, and appending two list elements as
    parameters to the command.

    Inside the compare proc:

    % proc compare {a b} {
    set a0 [lindex $a 0]
    set b0 [lindex $b 0]
    if {$a0 < $b0} {
    return -1
    } elseif {$a0 > $b0} {
    return 1
    }
    return [string compare [lindex $a 1] [lindex $b 1]]
    }

    We have a parameter variable a and parameter variable b. The first
    thing compare does is extract the zero'th element of a and of b (so the
    3 in "3 apple" or the 1 in {1 dingo}. Then it checks if a0 is smaller
    than b0, if yes, it returns -1 (if you read the lsort man page you'll
    see that the expected return values from the -command is negative if a
    is smaller than b, zero of a equals b, and positive if a is larger than
    b. So the first if clause checks for a smaller than b and returns -1
    if so. Then it checks for a larger than b, returning 1 if that is the
    case. All that is left is a equal to b, and for that case it checks
    the second element of a and b (the word) as a tie breaker and returns
    -1, 0, or 1 again, this time using string compare.

    Does it stride in leaps of 2 because two arguments were given (a and b)?

    lsort calls compare with two elements, which two depend upon what exact
    sort algorithm it is that lsort uses.

    I guess {3 apple} is compared with {0x2 carrot}. Actually, just their
    0 indices. So 3 > 1 so return 1. OK, 1. But what is "1" here?

    The documented value that -command should return when the a parameter
    is larger than the b parameter. It is right there in the manpage.

    Holy mackerew. I need to dissect this thing:

    proc compare {a b} {
    puts "begin: a is $a b is $b"
    set a0 [lindex $a 0]
    set b0 [lindex $b 0]
    if {$a0 < $b0} {
    puts "R1 $a0 < $b0, returning -1"
    return -1
    } elseif {$a0 > $b0} {
    puts "R2 $a0 > $b0, returning 1"
    return 1
    }
    puts "two tests failed with $a0 vs. $b0, going to the third"
    puts "R3 [lindex $a 1] [lindex $b 1], returning [string compare [lindex $a 1] [lindex $b 1]] for [lindex $a 1] and [lindex $b 1]"
    return [string compare [lindex $a 1] [lindex $b 1]]
    }
    puts [lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}]


    Output:

    begin: a is 3 apple b is 0x2 carrot
    R2 3 > 0x2, returning 1
    begin: a is 1 dingo b is 2 banana
    R1 1 < 2, returning -1
    begin: a is 0x2 carrot b is 1 dingo
    R2 0x2 > 1, returning 1
    begin: a is 0x2 carrot b is 2 banana
    two tests failed with 0x2 vs. 2, going to the third
    R3 carrot banana, returning 1 for carrot and banana
    {1 dingo} {2 banana} {0x2 carrot} {3 apple}

    So I see the first pair was compared to the second pair, then the third
    pair was compared to the fourth pair... OK...

    Then the second pair is compared to the fourth pair... Why?

    Because that's the order that lsort's algorithm requests that the
    comparisons be performed.

    And the results of all the comparisons were 1, -1, 1 and 1. OK.
    And what does that mean?

    1 signals to lsort that the first command parameter was larger than the
    second command parameter. -1 signals the opposite. 0 signals equal,
    although no equal values appeared here. The -1, 0, 1 is how the
    compare proc tells lsort which item it selected to be compared is
    larger (or if they are equal).

    Sorry, but I really can't understand the shuffle.

    It's not a shuffle, it's a sort (although sorts and shuffles are
    somewhat two sides of the same coin).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Rich on Thu Dec 21 20:40:20 2023
    On Wed, 20 Dec 2023 02:41:42 -0000 (UTC), Rich wrote:

    It's not a shuffle, it's a sort (although sorts and shuffles are
    somewhat two sides of the same coin).
    **************************

    I guess I struggled to understand the concept because I had planted some
    idea in my head that was a lot more complicated. I continually failed
    to realize that it amounts to mere one on one comparisons and a very
    simple 1/-1 output. Now I get it and my code has shrunk significantly
    and it's working very well.

    I also learned that one pass of lsort "survives" a previous pass so I
    can sort the listing by size then again by directories versus files and
    have everything sorted by size but with directories first. Super neat.
    Once again, the way Tcl handles lists surprises me in a very good way.

    Thank you.


    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Fri Dec 22 02:44:44 2023
    Luc <[email protected]d> wrote:
    On Wed, 20 Dec 2023 02:41:42 -0000 (UTC), Rich wrote:

    It's not a shuffle, it's a sort (although sorts and shuffles are
    somewhat two sides of the same coin).
    **************************

    I guess I struggled to understand the concept because I had planted
    some idea in my head that was a lot more complicated.

    Basic sorts are not terribly complicated to understand. Where the
    complication arises is the algorithms that try to reduce the number of comparisons and/or amount of data movement. One of the simpler ones to understand (but not an efficient algorithm) is the bubble sort <https://en.wikipedia.org/wiki/Bubble_sort>.

    I continually failed to realize that it amounts to mere one on one comparisons and a very simple 1/-1 output.

    Yes, for sorts, the unit of comparison is pairs. The advanced
    algorithms just produce sorted output while minimizing the number of comparisons that need to be made.

    I also learned that one pass of lsort "survives" a previous pass so I
    can sort the listing by size then again by directories versus files
    and have everything sorted by size but with directories first. Super
    neat.

    Yes, that is what is meant by the word "stable" in the second sentence
    of the manpage on lsort. Although one needs to have been exposed to
    some CS sorting theory before one realizes what meaning that word is
    conveying to a reader about the algorithm.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)