* Luc <[email protected]>
| Anyway, my question is, do you think it's possible to write Tcl code
| that can rival `find' in speed?
Somehow I doubt that a script based solution will be faster than one
in C (though the disk IO should be the limiting factor here).
Ralf Fassel <[email protected]> wrote:
And disk IO, esp. if using mechanical disks where seek times dominate
for "scan a directory hierarchy" runs, is going to be the ultimate
limiting factor. This fact will likely be what would make it appear
that a TCL and a /usr/bin/find scan were close in time -- both spent a majority (as in 98%+) of their runtime waiting for disk head seeks to complete.
Running on an SSD would remove the seek time overhead, and likely
result in /usr/bin/find surpassing a TCL solution by a substantial
margin.
On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:
Ralf Fassel <[email protected]> wrote:
And disk IO, esp. if using mechanical disks where seek times dominate
for "scan a directory hierarchy" runs, is going to be the ultimate
limiting factor. This fact will likely be what would make it appear
that a TCL and a /usr/bin/find scan were close in time -- both spent a
majority (as in 98%+) of their runtime waiting for disk head seeks to
complete.
Running on an SSD would remove the seek time overhead, and likely
result in /usr/bin/find surpassing a TCL solution by a substantial
margin.
The disk I/O bottleneck is not very relevant because I am not as concerned with how long it's going to take as I am with how much LONGER than `find' it's going to take.
I intend to release the end product as an application so it's not just for me, and people are expected to understand that scanning the entire HD is going to take some time. The core of the issue here is whether it's still worth trying to do everything in Tcl or I should just accept the facts of life and do some [exec find] thing.
I'm also considering the option of collecting additional data on every
file such as size, date and permissions, up to the user. For that I would feel a lot more comfortable using pure Tcl. The current code has none of
that but it occurs to me that some people may want it.
On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:
Ralf Fassel <[email protected]> wrote:
And disk IO, esp. if using mechanical disks where seek times dominate
for "scan a directory hierarchy" runs, is going to be the ultimate
limiting factor. This fact will likely be what would make it appear
that a TCL and a /usr/bin/find scan were close in time -- both spent a majority (as in 98%+) of their runtime waiting for disk head seeks to complete.
Running on an SSD would remove the seek time overhead, and likely
result in /usr/bin/find surpassing a TCL solution by a substantial
margin.
The disk I/O bottleneck is not very relevant because I am not as concerned with how long it's going to take as I am with how much LONGER than `find' it's going to take.
I intend to release the end product as an application so it's not just for me, and people are expected to understand that scanning the entire HD is going to take some time. The core of the issue here is whether it's still worth trying to do everything in Tcl or I should just accept the facts of life and do some [exec find] thing.
I'm also considering the option of collecting additional data on every
file such as size, date and permissions, up to the user. For that I would feel a lot more comfortable using pure Tcl. The current code has none of
that but it occurs to me that some people may want it.
So yeah, I guess I have to run some tests on that ::fileutil:: command
and see how well it performs against my Tcl code and `find'.
Thank you all.
I have this application that is divided between a shell script
and a Tcl script.
The shell script uses `find' to scan the entire hard disk and output
the full path of every single file into a catalog file. It has to be
run from time to time to update the catalog.
The Tcl script has a very quick'n'dirty GUI that accepts a string
as input, finds matches in the catalog and shows all the matches,
with the matched string highlighted.
It's a very old application of mine that I want to improve.
The first version of it did everything in one Tcl script, but
I remember when I replaced the Tcl proc with a shell script to scan
the hard disk because `find' was a lot faster than my Tcl code.
Of course, maybe my code was bad, but it was just a matter of going
into every directory found and globbing it. There wasn't a lot of opportunity for screwing up.
Anyway, my question is, do you think it's possible to write Tcl code
that can rival `find' in speed?
--
Luc
I doubt you'll be able to best the speed of find. I have written a
utility in Tcl that scans the entire hard drive. I used a threaded model
to try and take advantage of I/O latency, since it also gathers file size info. My assumption is that the OS will optimize its operations and
suspend the thread(s) until the data is ready. I have not timed it or compared to "find", but it is able to scan ~0.5TB fast enough for me.
It's not quick, nor does it take "forever." It also runs on all
platforms.
It scans the starting dir for files and subdirectories, and farms the subdirectories out to another thread from a pool. The thread jobs get
queued as worker threads become available. This is done recursively. The results in the worker thread are queued back to the main thread via a non-blocking callback command, making the worker thread quickly available
for the next job.
-Brian
On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:Yes, they do.
I doubt you'll be able to best the speed of find. I have written a
utility in Tcl that scans the entire hard drive. I used a threaded model
to try and take advantage of I/O latency, since it also gathers file size info. My assumption is that the OS will optimize its operations and
suspend the thread(s) until the data is ready. I have not timed it or compared to "find", but it is able to scan ~0.5TB fast enough for me.
It's not quick, nor does it take "forever." It also runs on all
platforms.
It scans the starting dir for files and subdirectories, and farms the subdirectories out to another thread from a pool. The thread jobs get queued as worker threads become available. This is done recursively. The results in the worker thread are queued back to the main thread via a non-blocking callback command, making the worker thread quickly available for the next job.
-BrianInteresting, but I wonder how effective that concept of threads really is. The CPU may support multiple threads, but does the hard disk?
--
Luc
On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:
I doubt you'll be able to best the speed of find. I have written a
utility in Tcl that scans the entire hard drive. I used a threaded model
to try and take advantage of I/O latency, since it also gathers file size
info. My assumption is that the OS will optimize its operations and
suspend the thread(s) until the data is ready. I have not timed it or
compared to "find", but it is able to scan ~0.5TB fast enough for me.
It's not quick, nor does it take "forever." It also runs on all
platforms.
It scans the starting dir for files and subdirectories, and farms the
subdirectories out to another thread from a pool. The thread jobs get
queued as worker threads become available. This is done recursively. The
results in the worker thread are queued back to the main thread via a
non-blocking callback command, making the worker thread quickly available
for the next job.
-Brian
Interesting, but I wonder how effective that concept of threads really is. The CPU may support multiple threads, but does the hard disk?
Luc <[email protected]> wrote:
On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:
I doubt you'll be able to best the speed of find. I have written a
utility in Tcl that scans the entire hard drive. I used a threaded model >> to try and take advantage of I/O latency, since it also gathers file size >> info. My assumption is that the OS will optimize its operations and
suspend the thread(s) until the data is ready. I have not timed it or
compared to "find", but it is able to scan ~0.5TB fast enough for me.
It's not quick, nor does it take "forever." It also runs on all
platforms.
It scans the starting dir for files and subdirectories, and farms the
subdirectories out to another thread from a pool. The thread jobs get
queued as worker threads become available. This is done recursively. The >> results in the worker thread are queued back to the main thread via a
non-blocking callback command, making the worker thread quickly available >> for the next job.
-Brian
Interesting, but I wonder how effective that concept of threads really is. The CPU may support multiple threads, but does the hard disk?
Yes. Look up Native Command Queuing:
https://en.wikipedia.org/wiki/NCQ
For a mechanical drive, there is only one head arm, so ultimately the "threads" serialize on that fact, but the drive can readjust ordering
to minimize head arm seeks.
For a SSD drive, since there is no head arm, there is no head arm seek
time, and depending upon the internal flash memory design, the
'threads' could possibly perform parallel reads from different areas of
the flash.
Anyway, my question is, do you think it's possible to write Tcl code
that can rival `find' in speed?
--
Luc
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 714 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 133:53:01 |
| Calls: | 12,087 |
| Files: | 14,997 |
| Messages: | 6,517,349 |