Forum: >>> Magnum BBS <<<

I need to organize some data

From Luc@21:1/5 to All on Tue Dec 19 18:09:49 2023

Solving problems and even reinventing wheels can be fun, I've done that a
lot, but sometimes it becomes tiresome and may even become stupid
pig-headed. So I decided to ask for help again.

I need to collect the names of all files and directories in a path. We have glob for that.

OK. Now I need to organize that data. The problem is, there may be many
ways of organizing data. Talk about skinning cats.

Maybe the user wants it all sorted by file name, alphabetically. OK. The problem is, maybe the user wants directories listed first. Maybe the user
wants files listed first. That alone means three possible lists. Not to
mention reverse order. That's another three lists.

Then maybe the user wants it all sorted by date. OK. The problem is, maybe
the user wants newest first, maybe the user wants oldest first. That alone means two possible lists. And again, maybe the user wants directories
listed first. Maybe the user wants files listed first. Maybe the user wants files and directories mixed. Those combined make six possible lists. And an additional problem. I will certainly use the Unix epoch to sort dates, but
will display a nicely formatted date string to the user later, so whatever
is used in the sorting process cannot be used in the display step.

More or less the same happens with size. I have to run calculations on raw
byte sizes before I format them as KB, MB, GB etc. After all those
permutations I mentioned before.

I've done all that. It works. The problem is, 1. It's 150 lines of code.
More IFs than ChatGPT. It's big and awkward. 2. I did it in a way that
sorts lists with strides and assumes that filename is [lindex list 0], size
is [lindex list 1], and so on. That is going to break because I want to let
the user change the position of the columns. Maybe the user wants filename-size-date-permission-owner or maybe just filename-date-size or
maybe filename-owner-date. Who knows? So I need to be able to query the
data in random orders, according to the user's choice of column
presence and order.

So I thought maybe I can use a dict:
dict create biglist $filelist
and $filelist would be yet another dict arrangement or maybe an array with
each file and its corresponding attributes.

Then I would just have to parse that dict, formatting size and date on the
fly before displaying them.

The problem is, it's still a lot of IFs.

Another problem is, there is a very, very unexpected little obstacle in the way. You would never guess what it is.

I use an old kernel. And there is a bug in BTRFS. One of my files is listed twice. Every time. Same name, same size, same date, same inode. I talked
about it in the #btrfs IRC channel and they confirmed the bug. It's old,
but it exists in old kernels. If it's there, it has to be seen. I can't let
my application hide it. It's not my job to sweep file system bugs under the carpet. Which means that, counterintuitively, I must not treat filenames as unique. I must assume there may be duplicates, as weird as that sounds. And they must be shown as such. You know what, I think I am going to write code
to flag that kind of situation very visibly to the user because that's the
kind of meticulous sad soul that I am. So anyway, I can't index my dict or array structure around file names.

That is going to make the problem more difficult. In fact, I'm not sure
where to start.

Would you like to contribute any ideas? No need for code. Just broad ideas
on how to organize the data efficiently. With duplicate file names.

TIA

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Tue Dec 19 22:49:37 2023

Luc <[email protected]d> wrote:

Solving problems and even reinventing wheels can be fun, I've done that a lot, but sometimes it becomes tiresome and may even become stupid
pig-headed. So I decided to ask for help again.

I need to collect the names of all files and directories in a path. We have glob for that.

OK. Now I need to organize that data. The problem is, there may be many
ways of organizing data. Talk about skinning cats.

Maybe the user wants it all sorted by file name, alphabetically. OK. The problem is, maybe the user wants directories listed first. Maybe the user wants files listed first. That alone means three possible lists. Not to mention reverse order. That's another three lists.

Why would that mean six lists?

Create one list of lists, with each sublist being the columns of raw
data for the files that you want to use.

Then, create a set of procs that utilize the "-command" interface to
lsort. One proc sorts by name, one proc by date, one proc by whatever
else.

Then when the user changes the sort order, just resort the single
nested list using the appropriate "sort type" proc.

Then maybe the user wants it all sorted by date. OK. The problem is, maybe the user wants newest first, maybe the user wants oldest first. That alone means two possible lists. And again, maybe the user wants directories
listed first. Maybe the user wants files listed first. Maybe the user wants files and directories mixed. Those combined make six possible lists. And an additional problem. I will certainly use the Unix epoch to sort dates, but will display a nicely formatted date string to the user later, so whatever
is used in the sorting process cannot be used in the display step.

Internal storage format and external viewing format are two different
things, and do not (and often are not) the same format. The view is
derived from the internal data by formatting it as it needs to be
formatted for the view.

I've done all that. It works. The problem is, 1. It's 150 lines of code.
More IFs than ChatGPT. It's big and awkward. 2. I did it in a way that
sorts lists with strides and assumes that filename is [lindex list 0], size is [lindex list 1], and so on. That is going to break because I want to let the user change the position of the columns.

lsort has the -index option just for sorting a list of lists.

And again, the internal storage format is not the same as display format.

Internal storage format is the same, always. That keeps the code
simpler.

If the user swaps around columns then you display the columns in a
different order to create the view, but the column order in the internal
format does not change.

Another problem is, there is a very, very unexpected little obstacle in the way. You would never guess what it is.

I use an old kernel. And there is a bug in BTRFS. One of my files is listed twice. Every time. Same name, same size, same date, same inode. I talked about it in the #btrfs IRC channel and they confirmed the bug. It's old,
but it exists in old kernels. If it's there, it has to be seen. I can't let my application hide it. It's not my job to sweep file system bugs under the carpet. Which means that, counterintuitively, I must not treat filenames as unique. I must assume there may be duplicates, as weird as that sounds. And they must be shown as such. You know what, I think I am going to write code to flag that kind of situation very visibly to the user because that's the kind of meticulous sad soul that I am. So anyway, I can't index my dict or array structure around file names.

Nope, so just don't. Use a nested list of lists, which will store
duplicates with no problem.

Note that if you store the "filename" separated from the "path to the
file" then you have to be able to support duplicate filenames anyway,
even without a "bug", because two sub directories could have the
identical filename inside.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Rich on Tue Dec 19 22:52:52 2023

On Tue, 19 Dec 2023 22:49:37 -0000 (UTC), Rich wrote:

Note that if you store the "filename" separated from the "path to the
file" then you have to be able to support duplicate filenames anyway,
even without a "bug", because two sub directories could have the
identical filename inside.

I don't really understand what you mean here because in my case there
won't be any subdirectories. Each "view" contemplates the flat listing
inside a directory. There is no recursive action here whatsoever.

There will be recursive actions when I get to the stage of copying,
moving and deleting, but I'm not dealing with any sharp edges yet.

Anyway, let's see the new problem I have now:

Create one list of lists, with each sublist being the columns of raw
data for the files that you want to use.

Then, create a set of procs that utilize the "-command" interface to
lsort. One proc sorts by name, one proc by date, one proc by whatever
else.

Then when the user changes the sort order, just resort the single
nested list using the appropriate "sort type" proc.

I never used the -command option before so I had to look it up.

Ugh! Why are so many examples in the manual so opaque?

% proc compare {a b} {
set a0 [lindex $a 0]
set b0 [lindex $b 0]
if {$a0 < $b0} {
return -1
} elseif {$a0 > $b0} {
return 1
}
return [string compare [lindex $a 1] [lindex $b 1]]
}

% lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}
{1 dingo} {2 banana} {0x2 carrot} {3 apple}

I copied and changed and ran that many times trying to understand what in
the world is going on here, but I can't. How does the comparison work?

Is {3 apple} compared to {0x2 carrot} then {1 dingo} compared to
{2 banana}?

Does it stride in leaps of 2 because two arguments were given (a and b)?

But that doesn't make sense. I can't find a hypothesis that makes sense.
I can't see the logic.

I guess {3 apple} is compared with {0x2 carrot}. Actually, just their
0 indices. So 3 > 1 so return 1. OK, 1. But what is "1" here?

Holy mackerew. I need to dissect this thing:

proc compare {a b} {
puts "begin: a is $a b is $b"
set a0 [lindex $a 0]
set b0 [lindex $b 0]
if {$a0 < $b0} {
puts "R1 $a0 < $b0, returning -1"
return -1
} elseif {$a0 > $b0} {
puts "R2 $a0 > $b0, returning 1"
return 1
}
puts "two tests failed with $a0 vs. $b0, going to the third"
puts "R3 [lindex $a 1] [lindex $b 1], returning [string compare [lindex $a 1] [lindex $b 1]] for [lindex $a 1] and [lindex $b 1]"
return [string compare [lindex $a 1] [lindex $b 1]]
}
puts [lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}]

Output:

begin: a is 3 apple b is 0x2 carrot
R2 3 > 0x2, returning 1
begin: a is 1 dingo b is 2 banana
R1 1 < 2, returning -1
begin: a is 0x2 carrot b is 1 dingo
R2 0x2 > 1, returning 1
begin: a is 0x2 carrot b is 2 banana
two tests failed with 0x2 vs. 2, going to the third
R3 carrot banana, returning 1 for carrot and banana
{1 dingo} {2 banana} {0x2 carrot} {3 apple}

So I see the first pair was compared to the second pair, then the third
pair was compared to the fourth pair... OK...

Then the second pair is compared to the fourth pair... Why?

And the results of all the comparisons were 1, -1, 1 and 1. OK.
And what does that mean?

Sorry, but I really can't understand the shuffle.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Wed Dec 20 02:41:42 2023

Luc <[email protected]d> wrote:

On Tue, 19 Dec 2023 22:49:37 -0000 (UTC), Rich wrote:

Note that if you store the "filename" separated from the "path to the
file" then you have to be able to support duplicate filenames anyway,
even without a "bug", because two sub directories could have the
identical filename inside.

I don't really understand what you mean here because in my case there
won't be any subdirectories. Each "view" contemplates the flat listing
inside a directory. There is no recursive action here whatsoever.

I had no idea whether you were building a recursive view or not. But
if you /were/ building one, and splitting the name away from the rest
of the path, then with recursion you can (more like 'will') have
identical "names" (just on different paths).

Create one list of lists, with each sublist being the columns of raw
data for the files that you want to use.

Then, create a set of procs that utilize the "-command" interface to
lsort. One proc sorts by name, one proc by date, one proc by whatever >>else.

Then when the user changes the sort order, just resort the single
nested list using the appropriate "sort type" proc.

I never used the -command option before so I had to look it up.

Ugh! Why are so many examples in the manual so opaque?

% proc compare {a b} {
set a0 [lindex $a 0]
set b0 [lindex $b 0]
if {$a0 < $b0} {
return -1
} elseif {$a0 > $b0} {
return 1
}
return [string compare [lindex $a 1] [lindex $b 1]]
}

% lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}
{1 dingo} {2 banana} {0x2 carrot} {3 apple}

There's nothing opaque above.

I copied and changed and ran that many times trying to understand what in
the world is going on here, but I can't. How does the comparison work?

Note, I don't know what exact algorithm lsort uses, but in a general
sense, what lsort does is it traverses the list, comparing pairs of
list elements as it goes, and decides whether to swap them or not,
based upon the return value of the '-command' proc. This may not match reality, but you should get the idea.

In the example, the 'list' is more easily viewed if formatted this way
(with list index values on the left):

{
0: {3 apple}
1: {0x2 carrot}
2: {1 dingo}
3: {2 banana}
}

Is {3 apple} compared to {0x2 carrot} then {1 dingo} compared to
{2 banana}?

No, 2 is compared to 0x2. The list is four elements, each a string
that is also a valid list. As lsort's algorithm chooses pairs to
compare, it calls -command, and appending two list elements as
parameters to the command.

Inside the compare proc:

% proc compare {a b} {
set a0 [lindex $a 0]
set b0 [lindex $b 0]
if {$a0 < $b0} {
return -1
} elseif {$a0 > $b0} {
return 1
}
return [string compare [lindex $a 1] [lindex $b 1]]
}

We have a parameter variable a and parameter variable b. The first
thing compare does is extract the zero'th element of a and of b (so the
3 in "3 apple" or the 1 in {1 dingo}. Then it checks if a0 is smaller
than b0, if yes, it returns -1 (if you read the lsort man page you'll
see that the expected return values from the -command is negative if a
is smaller than b, zero of a equals b, and positive if a is larger than
b. So the first if clause checks for a smaller than b and returns -1
if so. Then it checks for a larger than b, returning 1 if that is the
case. All that is left is a equal to b, and for that case it checks
the second element of a and b (the word) as a tie breaker and returns
-1, 0, or 1 again, this time using string compare.

Does it stride in leaps of 2 because two arguments were given (a and b)?

lsort calls compare with two elements, which two depend upon what exact
sort algorithm it is that lsort uses.

I guess {3 apple} is compared with {0x2 carrot}. Actually, just their
0 indices. So 3 > 1 so return 1. OK, 1. But what is "1" here?

The documented value that -command should return when the a parameter
is larger than the b parameter. It is right there in the manpage.

Holy mackerew. I need to dissect this thing:

proc compare {a b} {
puts "begin: a is $a b is $b"
set a0 [lindex $a 0]
set b0 [lindex $b 0]
if {$a0 < $b0} {
puts "R1 $a0 < $b0, returning -1"
return -1
} elseif {$a0 > $b0} {
puts "R2 $a0 > $b0, returning 1"
return 1
}
puts "two tests failed with $a0 vs. $b0, going to the third"
puts "R3 [lindex $a 1] [lindex $b 1], returning [string compare [lindex $a 1] [lindex $b 1]] for [lindex $a 1] and [lindex $b 1]"
return [string compare [lindex $a 1] [lindex $b 1]]
}
puts [lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}]

Output:

begin: a is 3 apple b is 0x2 carrot
R2 3 > 0x2, returning 1
begin: a is 1 dingo b is 2 banana
R1 1 < 2, returning -1
begin: a is 0x2 carrot b is 1 dingo
R2 0x2 > 1, returning 1
begin: a is 0x2 carrot b is 2 banana
two tests failed with 0x2 vs. 2, going to the third
R3 carrot banana, returning 1 for carrot and banana
{1 dingo} {2 banana} {0x2 carrot} {3 apple}

So I see the first pair was compared to the second pair, then the third
pair was compared to the fourth pair... OK...

Then the second pair is compared to the fourth pair... Why?

Because that's the order that lsort's algorithm requests that the
comparisons be performed.

And the results of all the comparisons were 1, -1, 1 and 1. OK.
And what does that mean?

1 signals to lsort that the first command parameter was larger than the
second command parameter. -1 signals the opposite. 0 signals equal,
although no equal values appeared here. The -1, 0, 1 is how the
compare proc tells lsort which item it selected to be compared is
larger (or if they are equal).

Sorry, but I really can't understand the shuffle.

It's not a shuffle, it's a sort (although sorts and shuffles are
somewhat two sides of the same coin).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Rich on Thu Dec 21 20:40:20 2023

On Wed, 20 Dec 2023 02:41:42 -0000 (UTC), Rich wrote:

It's not a shuffle, it's a sort (although sorts and shuffles are
somewhat two sides of the same coin).

**************************

I guess I struggled to understand the concept because I had planted some
idea in my head that was a lot more complicated. I continually failed
to realize that it amounts to mere one on one comparisons and a very
simple 1/-1 output. Now I get it and my code has shrunk significantly
and it's working very well.

I also learned that one pass of lsort "survives" a previous pass so I
can sort the listing by size then again by directories versus files and
have everything sorted by size but with directories first. Super neat.
Once again, the way Tcl handles lists surprises me in a very good way.

Thank you.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Fri Dec 22 02:44:44 2023

Luc <[email protected]d> wrote:

On Wed, 20 Dec 2023 02:41:42 -0000 (UTC), Rich wrote:

It's not a shuffle, it's a sort (although sorts and shuffles are
somewhat two sides of the same coin).

**************************

I guess I struggled to understand the concept because I had planted
some idea in my head that was a lot more complicated.

Basic sorts are not terribly complicated to understand. Where the
complication arises is the algorithms that try to reduce the number of comparisons and/or amount of data movement. One of the simpler ones to understand (but not an efficient algorithm) is the bubble sort <https://en.wikipedia.org/wiki/Bubble_sort>.

I continually failed to realize that it amounts to mere one on one comparisons and a very simple 1/-1 output.

Yes, for sorts, the unit of comparison is pairs. The advanced
algorithms just produce sorted output while minimizing the number of comparisons that need to be made.

I also learned that one pass of lsort "survives" a previous pass so I
can sort the listing by size then again by directories versus files
and have everything sorted by size but with directories first. Super
neat.

Yes, that is what is meant by the word "stable" in the second sentence
of the manpage on lsort. Although one needs to have been exposed to
some CS sorting theory before one realizes what meaning that word is
conveying to a reader about the algorithm.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet
- Rixter
  Mon Jul 27 13:04:59 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	45:24:12
Calls:	12,444
Calls today:	4
Files:	15,192
Messages:	6,537,093

I need to organize some data

Who's Online

Recent Visitors

System Info