Forum: >>> Magnum BBS <<<

Can Tcl scan faster than find?

From Luc@21:1/5 to do you think it's possible to on Thu Dec 8 19:01:18 2022

I have this application that is divided between a shell script
and a Tcl script.

The shell script uses `find' to scan the entire hard disk and output
the full path of every single file into a catalog file. It has to be
run from time to time to update the catalog.

The Tcl script has a very quick'n'dirty GUI that accepts a string
as input, finds matches in the catalog and shows all the matches,
with the matched string highlighted.

It's a very old application of mine that I want to improve.

The first version of it did everything in one Tcl script, but
I remember when I replaced the Tcl proc with a shell script to scan
the hard disk because `find' was a lot faster than my Tcl code.

Of course, maybe my code was bad, but it was just a matter of going
into every directory found and globbing it. There wasn't a lot of
opportunity for screwing up.

Anyway, my question is, do you think it's possible to write Tcl code
that can rival `find' in speed?

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ralf Fassel@21:1/5 to All on Fri Dec 9 11:54:37 2022

* Luc <[email protected]>
| Anyway, my question is, do you think it's possible to write Tcl code
| that can rival `find' in speed?

There is the fileutil package in tcllib:

https://core.tcl-lang.org/tcllib/doc/trunk/embedded/md/tcllib/files/modules/fileutil/fileutil.md

which contains

::fileutil::find ?basedir ?filtercmd??

An implementation of the unix command find. Adapted from the Tcler's
Wiki. Takes at most two arguments, the path to the directory to start
searching from and a command to use to evaluate interest in each
file. [...]

Maybe give it a try? Note that the command returns only after all files
have been found, so for a 'live' application you would start it in a
separate thread and communicate the files via the filtercmd to the main
thread (or play around with 'update' in the filtercmd).

Somehow I doubt that a script based solution will be faster than one
in C (though the disk IO should be the limiting factor here).

R'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Ralf Fassel on Fri Dec 9 16:04:29 2022

Ralf Fassel <[email protected]> wrote:

* Luc <[email protected]>
| Anyway, my question is, do you think it's possible to write Tcl code
| that can rival `find' in speed?

Somehow I doubt that a script based solution will be faster than one
in C (though the disk IO should be the limiting factor here).

Agreed. I also doubt a TCL variant will be faster than the
/usr/bin/find utility for identical scans.

And disk IO, esp. if using mechanical disks where seek times dominate
for "scan a directory hierarchy" runs, is going to be the ultimate
limiting factor. This fact will likely be what would make it appear
that a TCL and a /usr/bin/find scan were close in time -- both spent a
majority (as in 98%+) of their runtime waiting for disk head seeks to
complete.

Running on an SSD would remove the seek time overhead, and likely
result in /usr/bin/find surpassing a TCL solution by a substantial
margin.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Rich on Fri Dec 9 17:36:52 2022

On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:

Ralf Fassel <[email protected]> wrote:

And disk IO, esp. if using mechanical disks where seek times dominate
for "scan a directory hierarchy" runs, is going to be the ultimate
limiting factor. This fact will likely be what would make it appear
that a TCL and a /usr/bin/find scan were close in time -- both spent a majority (as in 98%+) of their runtime waiting for disk head seeks to complete.

Running on an SSD would remove the seek time overhead, and likely
result in /usr/bin/find surpassing a TCL solution by a substantial
margin.

The disk I/O bottleneck is not very relevant because I am not as concerned
with how long it's going to take as I am with how much LONGER than `find'
it's going to take.

I intend to release the end product as an application so it's not just for
me, and people are expected to understand that scanning the entire HD is
going to take some time. The core of the issue here is whether it's still
worth trying to do everything in Tcl or I should just accept the facts of
life and do some [exec find] thing.

I'm also considering the option of collecting additional data on every
file such as size, date and permissions, up to the user. For that I would
feel a lot more comfortable using pure Tcl. The current code has none of
that but it occurs to me that some people may want it.

So yeah, I guess I have to run some tests on that ::fileutil:: command
and see how well it performs against my Tcl code and `find'.

Thank you all.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Fri Dec 9 20:55:53 2022

Luc <[email protected]> wrote:

On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:

Ralf Fassel <[email protected]> wrote:

And disk IO, esp. if using mechanical disks where seek times dominate
for "scan a directory hierarchy" runs, is going to be the ultimate
limiting factor. This fact will likely be what would make it appear
that a TCL and a /usr/bin/find scan were close in time -- both spent a
majority (as in 98%+) of their runtime waiting for disk head seeks to
complete.

Running on an SSD would remove the seek time overhead, and likely
result in /usr/bin/find surpassing a TCL solution by a substantial
margin.

The disk I/O bottleneck is not very relevant because I am not as concerned with how long it's going to take as I am with how much LONGER than `find' it's going to take.

If you want to quantify "how much longer" then your only option may be
to run tests. About all any of us can say without actually testing is
"TCL is likely to be slower".

I intend to release the end product as an application so it's not just for me, and people are expected to understand that scanning the entire HD is going to take some time. The core of the issue here is whether it's still worth trying to do everything in Tcl or I should just accept the facts of life and do some [exec find] thing.

Do you plan to make the end product be cross platform (i.e., run on
Linux, Windows, and Mac)? If yes, then you'd want to write it all in
Tcl, even if slower, because there is no equivalent to 'find' on win
(at least not in the default MS install) and while there is one on Mac,
the BSD vs. GNU differences might make for the need for two different
process loops.

I'm also considering the option of collecting additional data on every
file such as size, date and permissions, up to the user. For that I would feel a lot more comfortable using pure Tcl. The current code has none of
that but it occurs to me that some people may want it.

GNU find has the ability to output much of this with it's "-print"
option, which might make find even faster than TCL -- but then you /do/
still have to parse the output in TCL, possibly negating the
difference. But that option to find may not exist on Mac, and there is
no 'find' on windows by default.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Heller@21:1/5 to [email protected] on Fri Dec 9 21:51:12 2022

At Fri, 9 Dec 2022 17:36:52 -0300 Luc <[email protected]> wrote:

On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:

Ralf Fassel <[email protected]> wrote:

And disk IO, esp. if using mechanical disks where seek times dominate
for "scan a directory hierarchy" runs, is going to be the ultimate
limiting factor. This fact will likely be what would make it appear
that a TCL and a /usr/bin/find scan were close in time -- both spent a majority (as in 98%+) of their runtime waiting for disk head seeks to complete.

Running on an SSD would remove the seek time overhead, and likely
result in /usr/bin/find surpassing a TCL solution by a substantial
margin.

The disk I/O bottleneck is not very relevant because I am not as concerned with how long it's going to take as I am with how much LONGER than `find' it's going to take.

I intend to release the end product as an application so it's not just for me, and people are expected to understand that scanning the entire HD is going to take some time. The core of the issue here is whether it's still worth trying to do everything in Tcl or I should just accept the facts of life and do some [exec find] thing.

More likely:

set fp [open "|find ..." r];# replace '...' with find's params and opts

fileevent $fp readable [list processfile $fp]

## called as
proc processfile {fp} {
if {[gets $fp pathname] >= 0} {
# process pathname (eg using "file <command> $pathname ..." as desired)
} else {
catch {close $fp}
exit; # or whatever
}
}

vwait forever;# don't forget this at the end (if Tk is not in play).

I'm also considering the option of collecting additional data on every
file such as size, date and permissions, up to the user. For that I would feel a lot more comfortable using pure Tcl. The current code has none of
that but it occurs to me that some people may want it.

So yeah, I guess I have to run some tests on that ::fileutil:: command
and see how well it performs against my Tcl code and `find'.

Thank you all.

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
[email protected] -- Webhosting Services

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From briang@21:1/5 to Luc on Sat Dec 10 15:05:54 2022

On Thursday, December 8, 2022 at 2:01:23 PM UTC-8, Luc wrote:

I have this application that is divided between a shell script
and a Tcl script.

The shell script uses `find' to scan the entire hard disk and output
the full path of every single file into a catalog file. It has to be
run from time to time to update the catalog.

The Tcl script has a very quick'n'dirty GUI that accepts a string
as input, finds matches in the catalog and shows all the matches,
with the matched string highlighted.

It's a very old application of mine that I want to improve.

The first version of it did everything in one Tcl script, but
I remember when I replaced the Tcl proc with a shell script to scan
the hard disk because `find' was a lot faster than my Tcl code.

Of course, maybe my code was bad, but it was just a matter of going
into every directory found and globbing it. There wasn't a lot of opportunity for screwing up.

Anyway, my question is, do you think it's possible to write Tcl code
that can rival `find' in speed?

--
Luc

I doubt you'll be able to best the speed of find. I have written a utility in Tcl that scans the entire hard drive. I used a threaded model to try and take advantage of I/O latency, since it also gathers file size info. My assumption is that the OS
will optimize its operations and suspend the thread(s) until the data is ready. I have not timed it or compared to "find", but it is able to scan ~0.5TB fast enough for me. It's not quick, nor does it take "forever." It also runs on all platforms.

It scans the starting dir for files and subdirectories, and farms the subdirectories out to another thread from a pool. The thread jobs get queued as worker threads become available. This is done recursively. The results in the worker thread are queued
back to the main thread via a non-blocking callback command, making the worker thread quickly available for the next job.

-Brian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to briang on Sat Dec 10 20:30:02 2022

On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:

I doubt you'll be able to best the speed of find. I have written a
utility in Tcl that scans the entire hard drive. I used a threaded model
to try and take advantage of I/O latency, since it also gathers file size info. My assumption is that the OS will optimize its operations and
suspend the thread(s) until the data is ready. I have not timed it or compared to "find", but it is able to scan ~0.5TB fast enough for me.
It's not quick, nor does it take "forever." It also runs on all
platforms.

It scans the starting dir for files and subdirectories, and farms the subdirectories out to another thread from a pool. The thread jobs get
queued as worker threads become available. This is done recursively. The results in the worker thread are queued back to the main thread via a non-blocking callback command, making the worker thread quickly available
for the next job.

-Brian

Interesting, but I wonder how effective that concept of threads really is.
The CPU may support multiple threads, but does the hard disk?

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From briang@21:1/5 to Luc on Sat Dec 10 16:00:32 2022

On Saturday, December 10, 2022 at 3:30:08 PM UTC-8, Luc wrote:

On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:

I doubt you'll be able to best the speed of find. I have written a
utility in Tcl that scans the entire hard drive. I used a threaded model
to try and take advantage of I/O latency, since it also gathers file size info. My assumption is that the OS will optimize its operations and
suspend the thread(s) until the data is ready. I have not timed it or compared to "find", but it is able to scan ~0.5TB fast enough for me.
It's not quick, nor does it take "forever." It also runs on all
platforms.

It scans the starting dir for files and subdirectories, and farms the subdirectories out to another thread from a pool. The thread jobs get queued as worker threads become available. This is done recursively. The results in the worker thread are queued back to the main thread via a non-blocking callback command, making the worker thread quickly available for the next job.

-Brian

Interesting, but I wonder how effective that concept of threads really is. The CPU may support multiple threads, but does the hard disk?

--
Luc

Yes, they do.

-Brian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Sun Dec 11 02:47:20 2022

Luc <[email protected]> wrote:

On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:

I doubt you'll be able to best the speed of find. I have written a
utility in Tcl that scans the entire hard drive. I used a threaded model
to try and take advantage of I/O latency, since it also gathers file size
info. My assumption is that the OS will optimize its operations and
suspend the thread(s) until the data is ready. I have not timed it or
compared to "find", but it is able to scan ~0.5TB fast enough for me.
It's not quick, nor does it take "forever." It also runs on all
platforms.

It scans the starting dir for files and subdirectories, and farms the
subdirectories out to another thread from a pool. The thread jobs get
queued as worker threads become available. This is done recursively. The
results in the worker thread are queued back to the main thread via a
non-blocking callback command, making the worker thread quickly available
for the next job.

-Brian

Interesting, but I wonder how effective that concept of threads really is. The CPU may support multiple threads, but does the hard disk?

Yes. Look up Native Command Queuing:
https://en.wikipedia.org/wiki/NCQ

For a mechanical drive, there is only one head arm, so ultimately the
"threads" serialize on that fact, but the drive can readjust ordering
to minimize head arm seeks.

For a SSD drive, since there is no head arm, there is no head arm seek
time, and depending upon the internal flash memory design, the
'threads' could possibly perform parallel reads from different areas of
the flash.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Heller@21:1/5 to Rich on Sun Dec 11 05:10:18 2022

At Sun, 11 Dec 2022 02:47:20 -0000 (UTC) Rich <[email protected]d> wrote:

Luc <[email protected]> wrote:

On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:

I doubt you'll be able to best the speed of find. I have written a
utility in Tcl that scans the entire hard drive. I used a threaded model >> to try and take advantage of I/O latency, since it also gathers file size >> info. My assumption is that the OS will optimize its operations and
suspend the thread(s) until the data is ready. I have not timed it or
compared to "find", but it is able to scan ~0.5TB fast enough for me.
It's not quick, nor does it take "forever." It also runs on all
platforms.

It scans the starting dir for files and subdirectories, and farms the
subdirectories out to another thread from a pool. The thread jobs get
queued as worker threads become available. This is done recursively. The >> results in the worker thread are queued back to the main thread via a
non-blocking callback command, making the worker thread quickly available >> for the next job.

-Brian

Interesting, but I wonder how effective that concept of threads really is. The CPU may support multiple threads, but does the hard disk?

Yes. Look up Native Command Queuing:
https://en.wikipedia.org/wiki/NCQ

For a mechanical drive, there is only one head arm, so ultimately the "threads" serialize on that fact, but the drive can readjust ordering
to minimize head arm seeks.

For a SSD drive, since there is no head arm, there is no head arm seek
time, and depending upon the internal flash memory design, the
'threads' could possibly perform parallel reads from different areas of
the flash.

I would expect that at the application level, disk I/O might not be tied *directly* to physical "disk" I/O, but rather be accessing the RAM-based disk cache buffers. That is the *kernel* might be reading large parts of the disk (whole tracks) into RAM buffers. Depending on how the data is on the "disk", it *might* be possible to effective access multiple parts of the disk "concurrently" with different threads.

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
[email protected] -- Webhosting Services

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From blacksqr@21:1/5 to Luc on Mon Dec 19 09:04:41 2022

On Thursday, December 8, 2022 at 4:01:23 PM UTC-6, Luc wrote:

Anyway, my question is, do you think it's possible to write Tcl code
that can rival `find' in speed?

--
Luc

I wrote a Tcl program called globfind a while back (https://wiki.tcl-lang.org/page/globfind) which I tried to optimize for speed in searches of large filesystem spaces. I got a performance improvement of about three times over Tcllib's fileutil::find,
but it's still slower than GNU find. A large pattern-match search using globfind requires about 150% of the time GNU find takes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Thu Jul 30 02:32:09 2026
  from Madison, Nc via Telnet
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet
- Zenobyte
  Wed Jul 29 21:08:05 2026
  from San Juan, Pr via Telnet
- Guest
  Wed Jul 29 14:26:54 2026
  from Balkans via Telnet
- Rixter
  Wed Jul 29 14:18:17 2026
  from Madison, Nc via Telnet
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	82:55:49
Calls:	12,451
Calls today:	1
Files:	15,194
Messages:	6,537,779

Can Tcl scan faster than find?

Who's Online

Recent Visitors

System Info