• Can Tcl scan faster than find?

    From Luc@21:1/5 to do you think it's possible to on Thu Dec 8 19:01:18 2022
    I have this application that is divided between a shell script
    and a Tcl script.

    The shell script uses `find' to scan the entire hard disk and output
    the full path of every single file into a catalog file. It has to be
    run from time to time to update the catalog.

    The Tcl script has a very quick'n'dirty GUI that accepts a string
    as input, finds matches in the catalog and shows all the matches,
    with the matched string highlighted.

    It's a very old application of mine that I want to improve.

    The first version of it did everything in one Tcl script, but
    I remember when I replaced the Tcl proc with a shell script to scan
    the hard disk because `find' was a lot faster than my Tcl code.

    Of course, maybe my code was bad, but it was just a matter of going
    into every directory found and globbing it. There wasn't a lot of
    opportunity for screwing up.

    Anyway, my question is, do you think it's possible to write Tcl code
    that can rival `find' in speed?

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ralf Fassel@21:1/5 to All on Fri Dec 9 11:54:37 2022
    * Luc <[email protected]>
    | Anyway, my question is, do you think it's possible to write Tcl code
    | that can rival `find' in speed?

    There is the fileutil package in tcllib:

    https://core.tcl-lang.org/tcllib/doc/trunk/embedded/md/tcllib/files/modules/fileutil/fileutil.md

    which contains

    ::fileutil::find ?basedir ?filtercmd??

    An implementation of the unix command find. Adapted from the Tcler's
    Wiki. Takes at most two arguments, the path to the directory to start
    searching from and a command to use to evaluate interest in each
    file. [...]

    Maybe give it a try? Note that the command returns only after all files
    have been found, so for a 'live' application you would start it in a
    separate thread and communicate the files via the filtercmd to the main
    thread (or play around with 'update' in the filtercmd).

    Somehow I doubt that a script based solution will be faster than one
    in C (though the disk IO should be the limiting factor here).

    R'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Ralf Fassel on Fri Dec 9 16:04:29 2022
    Ralf Fassel <[email protected]> wrote:
    * Luc <[email protected]>
    | Anyway, my question is, do you think it's possible to write Tcl code
    | that can rival `find' in speed?

    Somehow I doubt that a script based solution will be faster than one
    in C (though the disk IO should be the limiting factor here).

    Agreed. I also doubt a TCL variant will be faster than the
    /usr/bin/find utility for identical scans.

    And disk IO, esp. if using mechanical disks where seek times dominate
    for "scan a directory hierarchy" runs, is going to be the ultimate
    limiting factor. This fact will likely be what would make it appear
    that a TCL and a /usr/bin/find scan were close in time -- both spent a
    majority (as in 98%+) of their runtime waiting for disk head seeks to
    complete.

    Running on an SSD would remove the seek time overhead, and likely
    result in /usr/bin/find surpassing a TCL solution by a substantial
    margin.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Rich on Fri Dec 9 17:36:52 2022
    On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:

    Ralf Fassel <[email protected]> wrote:

    And disk IO, esp. if using mechanical disks where seek times dominate
    for "scan a directory hierarchy" runs, is going to be the ultimate
    limiting factor. This fact will likely be what would make it appear
    that a TCL and a /usr/bin/find scan were close in time -- both spent a majority (as in 98%+) of their runtime waiting for disk head seeks to complete.

    Running on an SSD would remove the seek time overhead, and likely
    result in /usr/bin/find surpassing a TCL solution by a substantial
    margin.


    The disk I/O bottleneck is not very relevant because I am not as concerned
    with how long it's going to take as I am with how much LONGER than `find'
    it's going to take.

    I intend to release the end product as an application so it's not just for
    me, and people are expected to understand that scanning the entire HD is
    going to take some time. The core of the issue here is whether it's still
    worth trying to do everything in Tcl or I should just accept the facts of
    life and do some [exec find] thing.

    I'm also considering the option of collecting additional data on every
    file such as size, date and permissions, up to the user. For that I would
    feel a lot more comfortable using pure Tcl. The current code has none of
    that but it occurs to me that some people may want it.

    So yeah, I guess I have to run some tests on that ::fileutil:: command
    and see how well it performs against my Tcl code and `find'.

    Thank you all.


    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Fri Dec 9 20:55:53 2022
    Luc <[email protected]> wrote:
    On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:

    Ralf Fassel <[email protected]> wrote:

    And disk IO, esp. if using mechanical disks where seek times dominate
    for "scan a directory hierarchy" runs, is going to be the ultimate
    limiting factor. This fact will likely be what would make it appear
    that a TCL and a /usr/bin/find scan were close in time -- both spent a
    majority (as in 98%+) of their runtime waiting for disk head seeks to
    complete.

    Running on an SSD would remove the seek time overhead, and likely
    result in /usr/bin/find surpassing a TCL solution by a substantial
    margin.


    The disk I/O bottleneck is not very relevant because I am not as concerned with how long it's going to take as I am with how much LONGER than `find' it's going to take.

    If you want to quantify "how much longer" then your only option may be
    to run tests. About all any of us can say without actually testing is
    "TCL is likely to be slower".

    I intend to release the end product as an application so it's not just for me, and people are expected to understand that scanning the entire HD is going to take some time. The core of the issue here is whether it's still worth trying to do everything in Tcl or I should just accept the facts of life and do some [exec find] thing.

    Do you plan to make the end product be cross platform (i.e., run on
    Linux, Windows, and Mac)? If yes, then you'd want to write it all in
    Tcl, even if slower, because there is no equivalent to 'find' on win
    (at least not in the default MS install) and while there is one on Mac,
    the BSD vs. GNU differences might make for the need for two different
    process loops.

    I'm also considering the option of collecting additional data on every
    file such as size, date and permissions, up to the user. For that I would feel a lot more comfortable using pure Tcl. The current code has none of
    that but it occurs to me that some people may want it.

    GNU find has the ability to output much of this with it's "-print"
    option, which might make find even faster than TCL -- but then you /do/
    still have to parse the output in TCL, possibly negating the
    difference. But that option to find may not exist on Mac, and there is
    no 'find' on windows by default.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Heller@21:1/5 to [email protected] on Fri Dec 9 21:51:12 2022
    At Fri, 9 Dec 2022 17:36:52 -0300 Luc <[email protected]> wrote:


    On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:

    Ralf Fassel <[email protected]> wrote:

    And disk IO, esp. if using mechanical disks where seek times dominate
    for "scan a directory hierarchy" runs, is going to be the ultimate
    limiting factor. This fact will likely be what would make it appear
    that a TCL and a /usr/bin/find scan were close in time -- both spent a majority (as in 98%+) of their runtime waiting for disk head seeks to complete.

    Running on an SSD would remove the seek time overhead, and likely
    result in /usr/bin/find surpassing a TCL solution by a substantial
    margin.


    The disk I/O bottleneck is not very relevant because I am not as concerned with how long it's going to take as I am with how much LONGER than `find' it's going to take.

    I intend to release the end product as an application so it's not just for me, and people are expected to understand that scanning the entire HD is going to take some time. The core of the issue here is whether it's still worth trying to do everything in Tcl or I should just accept the facts of life and do some [exec find] thing.

    More likely:

    set fp [open "|find ..." r];# replace '...' with find's params and opts

    fileevent $fp readable [list processfile $fp]

    ## called as
    proc processfile {fp} {
    if {[gets $fp pathname] >= 0} {
    # process pathname (eg using "file <command> $pathname ..." as desired)
    } else {
    catch {close $fp}
    exit; # or whatever
    }
    }

    vwait forever;# don't forget this at the end (if Tk is not in play).


    I'm also considering the option of collecting additional data on every
    file such as size, date and permissions, up to the user. For that I would feel a lot more comfortable using pure Tcl. The current code has none of
    that but it occurs to me that some people may want it.

    So yeah, I guess I have to run some tests on that ::fileutil:: command
    and see how well it performs against my Tcl code and `find'.

    Thank you all.



    --
    Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
    Deepwoods Software -- Custom Software Services
    http://www.deepsoft.com/ -- Linux Administration Services
    [email protected] -- Webhosting Services

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From briang@21:1/5 to Luc on Sat Dec 10 15:05:54 2022
    On Thursday, December 8, 2022 at 2:01:23 PM UTC-8, Luc wrote:
    I have this application that is divided between a shell script
    and a Tcl script.

    The shell script uses `find' to scan the entire hard disk and output
    the full path of every single file into a catalog file. It has to be
    run from time to time to update the catalog.

    The Tcl script has a very quick'n'dirty GUI that accepts a string
    as input, finds matches in the catalog and shows all the matches,
    with the matched string highlighted.

    It's a very old application of mine that I want to improve.

    The first version of it did everything in one Tcl script, but
    I remember when I replaced the Tcl proc with a shell script to scan
    the hard disk because `find' was a lot faster than my Tcl code.

    Of course, maybe my code was bad, but it was just a matter of going
    into every directory found and globbing it. There wasn't a lot of opportunity for screwing up.

    Anyway, my question is, do you think it's possible to write Tcl code
    that can rival `find' in speed?

    --
    Luc


    I doubt you'll be able to best the speed of find. I have written a utility in Tcl that scans the entire hard drive. I used a threaded model to try and take advantage of I/O latency, since it also gathers file size info. My assumption is that the OS
    will optimize its operations and suspend the thread(s) until the data is ready. I have not timed it or compared to "find", but it is able to scan ~0.5TB fast enough for me. It's not quick, nor does it take "forever." It also runs on all platforms.

    It scans the starting dir for files and subdirectories, and farms the subdirectories out to another thread from a pool. The thread jobs get queued as worker threads become available. This is done recursively. The results in the worker thread are queued
    back to the main thread via a non-blocking callback command, making the worker thread quickly available for the next job.

    -Brian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to briang on Sat Dec 10 20:30:02 2022
    On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:

    I doubt you'll be able to best the speed of find. I have written a
    utility in Tcl that scans the entire hard drive. I used a threaded model
    to try and take advantage of I/O latency, since it also gathers file size info. My assumption is that the OS will optimize its operations and
    suspend the thread(s) until the data is ready. I have not timed it or compared to "find", but it is able to scan ~0.5TB fast enough for me.
    It's not quick, nor does it take "forever." It also runs on all
    platforms.

    It scans the starting dir for files and subdirectories, and farms the subdirectories out to another thread from a pool. The thread jobs get
    queued as worker threads become available. This is done recursively. The results in the worker thread are queued back to the main thread via a non-blocking callback command, making the worker thread quickly available
    for the next job.

    -Brian

    Interesting, but I wonder how effective that concept of threads really is.
    The CPU may support multiple threads, but does the hard disk?

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From briang@21:1/5 to Luc on Sat Dec 10 16:00:32 2022
    On Saturday, December 10, 2022 at 3:30:08 PM UTC-8, Luc wrote:
    On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:

    I doubt you'll be able to best the speed of find. I have written a
    utility in Tcl that scans the entire hard drive. I used a threaded model
    to try and take advantage of I/O latency, since it also gathers file size info. My assumption is that the OS will optimize its operations and
    suspend the thread(s) until the data is ready. I have not timed it or compared to "find", but it is able to scan ~0.5TB fast enough for me.
    It's not quick, nor does it take "forever." It also runs on all
    platforms.

    It scans the starting dir for files and subdirectories, and farms the subdirectories out to another thread from a pool. The thread jobs get queued as worker threads become available. This is done recursively. The results in the worker thread are queued back to the main thread via a non-blocking callback command, making the worker thread quickly available for the next job.

    -Brian
    Interesting, but I wonder how effective that concept of threads really is. The CPU may support multiple threads, but does the hard disk?

    --
    Luc

    Yes, they do.

    -Brian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Sun Dec 11 02:47:20 2022
    Luc <[email protected]> wrote:
    On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:

    I doubt you'll be able to best the speed of find. I have written a
    utility in Tcl that scans the entire hard drive. I used a threaded model
    to try and take advantage of I/O latency, since it also gathers file size
    info. My assumption is that the OS will optimize its operations and
    suspend the thread(s) until the data is ready. I have not timed it or
    compared to "find", but it is able to scan ~0.5TB fast enough for me.
    It's not quick, nor does it take "forever." It also runs on all
    platforms.

    It scans the starting dir for files and subdirectories, and farms the
    subdirectories out to another thread from a pool. The thread jobs get
    queued as worker threads become available. This is done recursively. The
    results in the worker thread are queued back to the main thread via a
    non-blocking callback command, making the worker thread quickly available
    for the next job.

    -Brian

    Interesting, but I wonder how effective that concept of threads really is. The CPU may support multiple threads, but does the hard disk?

    Yes. Look up Native Command Queuing:
    https://en.wikipedia.org/wiki/NCQ

    For a mechanical drive, there is only one head arm, so ultimately the
    "threads" serialize on that fact, but the drive can readjust ordering
    to minimize head arm seeks.

    For a SSD drive, since there is no head arm, there is no head arm seek
    time, and depending upon the internal flash memory design, the
    'threads' could possibly perform parallel reads from different areas of
    the flash.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Heller@21:1/5 to Rich on Sun Dec 11 05:10:18 2022
    At Sun, 11 Dec 2022 02:47:20 -0000 (UTC) Rich <[email protected]d> wrote:


    Luc <[email protected]> wrote:
    On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:

    I doubt you'll be able to best the speed of find. I have written a
    utility in Tcl that scans the entire hard drive. I used a threaded model >> to try and take advantage of I/O latency, since it also gathers file size >> info. My assumption is that the OS will optimize its operations and
    suspend the thread(s) until the data is ready. I have not timed it or
    compared to "find", but it is able to scan ~0.5TB fast enough for me.
    It's not quick, nor does it take "forever." It also runs on all
    platforms.

    It scans the starting dir for files and subdirectories, and farms the
    subdirectories out to another thread from a pool. The thread jobs get
    queued as worker threads become available. This is done recursively. The >> results in the worker thread are queued back to the main thread via a
    non-blocking callback command, making the worker thread quickly available >> for the next job.

    -Brian

    Interesting, but I wonder how effective that concept of threads really is. The CPU may support multiple threads, but does the hard disk?

    Yes. Look up Native Command Queuing:
    https://en.wikipedia.org/wiki/NCQ

    For a mechanical drive, there is only one head arm, so ultimately the "threads" serialize on that fact, but the drive can readjust ordering
    to minimize head arm seeks.

    For a SSD drive, since there is no head arm, there is no head arm seek
    time, and depending upon the internal flash memory design, the
    'threads' could possibly perform parallel reads from different areas of
    the flash.

    I would expect that at the application level, disk I/O might not be tied *directly* to physical "disk" I/O, but rather be accessing the RAM-based disk cache buffers. That is the *kernel* might be reading large parts of the disk (whole tracks) into RAM buffers. Depending on how the data is on the "disk", it *might* be possible to effective access multiple parts of the disk "concurrently" with different threads.




    --
    Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
    Deepwoods Software -- Custom Software Services
    http://www.deepsoft.com/ -- Linux Administration Services
    [email protected] -- Webhosting Services

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From blacksqr@21:1/5 to Luc on Mon Dec 19 09:04:41 2022
    On Thursday, December 8, 2022 at 4:01:23 PM UTC-6, Luc wrote:
    Anyway, my question is, do you think it's possible to write Tcl code
    that can rival `find' in speed?

    --
    Luc


    I wrote a Tcl program called globfind a while back (https://wiki.tcl-lang.org/page/globfind) which I tried to optimize for speed in searches of large filesystem spaces. I got a performance improvement of about three times over Tcllib's fileutil::find,
    but it's still slower than GNU find. A large pattern-match search using globfind requires about 150% of the time GNU find takes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)