• Retrieving data from a thread

    From Luc@21:1/5 to All on Mon Dec 11 18:34:16 2023
    Hi.

    My application used to take 7 seconds to load. A little long.

    Now I am trying to implement some kind of spell checker so I just added
    code to load and process a file.

    set ::WordlistFile /home/tcl/longlist.txt
    set ::Wordlist ""
    set _fp [open $::WordlistFile r]
    while {![eof $_fp]} {
    set _line [string trim [gets $_fp]]
    if {$_line == ""} {continue}
    lappend ::Wordlist $_line
    }
    close $_fp

    Now it takes 10 seconds to load. Grumble.

    I thought about using threads to load the word list and make it all
    load faster.

    In fact, it already loads another list. Maybe I could "threadify" both?

    Anyway,

    I've been reading about threads and I must say the existing documentation
    is not very easy to understand. Code examples on google are pretty scarce
    too. I found an interesting one on StackOverflow where Donal (DKF)
    suggests using pools, but I couldn't make that work. I mean, I have to
    "get" data with tpool::get? But when? How do I know when the job is done?
    He also suggests tsv, but I found the relevant documentation hard to read
    and understand.

    Yet I've made prototypes that almost worked. Almost.

    The last mile I need, I think, is retrieving data from the thread. More specifically, ::Wordlist after it is built.

    I wish I didn't have to call it explicitly. Just let the thread build
    and set ::Wordlist whenever it feels ready. My goal is to let the thread
    load the list while the rest of the application loads other things.
    Of course, it must not take longer than a few seconds. It has to be done
    when I begin to type.

    Among so many attempts, I only came up with one that worked. But
    the application wouldn't load any faster. I must have done something
    wrong. Can you please enlighten me?

    package require Thread
    set thr [thread::create]
    thread::send $thr "set wordlistfile $::WordlistFile"
    thread::send -async $thr {
    set _fp [open $wordlistfile r]
    while {![eof $_fp]} {
    set _line [string trim [gets $_fp]]
    if {$_line == ""} {continue}
    lappend ::Wordlist $_line
    }
    close $_fp
    }
    puts "Now what?"


    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Luc on Mon Dec 11 16:24:36 2023
    On 12/11/2023 1:34 PM, Luc wrote:
    Hi.

    My application used to take 7 seconds to load. A little long.

    Now I am trying to implement some kind of spell checker so I just added
    code to load and process a file.

    set ::WordlistFile /home/tcl/longlist.txt
    set ::Wordlist ""
    set _fp [open $::WordlistFile r]
    while {![eof $_fp]} {
    set _line [string trim [gets $_fp]]
    if {$_line == ""} {continue}
    lappend ::Wordlist $_line
    }
    close $_fp

    Now it takes 10 seconds to load. Grumble.

    I thought about using threads to load the word list and make it all
    load faster.

    In fact, it already loads another list. Maybe I could "threadify" both?

    Anyway,

    I've been reading about threads and I must say the existing documentation
    is not very easy to understand. Code examples on google are pretty scarce too. I found an interesting one on StackOverflow where Donal (DKF)
    suggests using pools, but I couldn't make that work. I mean, I have to
    "get" data with tpool::get? But when? How do I know when the job is done?
    He also suggests tsv, but I found the relevant documentation hard to read
    and understand.

    Yet I've made prototypes that almost worked. Almost.

    The last mile I need, I think, is retrieving data from the thread. More specifically, ::Wordlist after it is built.

    I wish I didn't have to call it explicitly. Just let the thread build
    and set ::Wordlist whenever it feels ready. My goal is to let the thread
    load the list while the rest of the application loads other things.
    Of course, it must not take longer than a few seconds. It has to be done
    when I begin to type.

    Among so many attempts, I only came up with one that worked. But
    the application wouldn't load any faster. I must have done something
    wrong. Can you please enlighten me?

    package require Thread
    set thr [thread::create]
    thread::send $thr "set wordlistfile $::WordlistFile"
    thread::send -async $thr {
    set _fp [open $wordlistfile r]
    while {![eof $_fp]} {
    set _line [string trim [gets $_fp]]
    if {$_line == ""} {continue}
    lappend ::Wordlist $_line
    }
    close $_fp
    }
    puts "Now what?"



    What is /home/tcl/longlist.txt

    Are these the words to lookup or your spelling dictionary. I hate to assume. Based on your variable names, I can't tell which it is.

    But... I have a 400k word dictionary I got online somewhere. It's one word per line. No spaces, nothing to trim and no blank lines to remove. When I timed your code against a simpler,

    set _fp [open $::WordlistFile r]
    set data [read -nonewline $_fp]
    set Wordlist [split $data \n]
    close $_fp

    It went from 700ms to 50ms to create Wordlist. But again, not knowing for sure what that file really is....

    For example, if it is your spelling dictionary, I would preprocess it so the trim and blank line tests wouldn't be needed, and then I would store it in a tcl array as a hash table where an [info exist dictionary($someword)] could be used to check
    spelling.

    As to threads, I would *highly* recommend you get a copy of Ashok's book, The tcl programming language. It has a very good section on threads (and many other topics).

    See here: https://www.magicsplat.com/

    I purchased both the paper book and the pdf version (great for searching).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Mon Dec 11 22:56:10 2023
    On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:

    What is /home/tcl/longlist.txt

    Are these the words to lookup or your spelling dictionary. I hate to
    assume. Based on your variable names, I can't tell which it is.

    But... I have a 400k word dictionary I got online somewhere. It's one word >per line. No spaces, nothing to trim and no blank lines to remove. When I >timed your code against a simpler,

    set _fp [open $::WordlistFile r]
    set data [read -nonewline $_fp]
    set Wordlist [split $data \n]
    close $_fp

    It went from 700ms to 50ms to create Wordlist. But again, not knowing for >sure what that file really is....

    For example, if it is your spelling dictionary, I would preprocess it so
    the trim and blank line tests wouldn't be needed, and then I would store
    it in a tcl array as a hash table where an [info exist
    dictionary($someword)] could be used to check spelling.

    As to threads, I would *highly* recommend you get a copy of Ashok's book,
    The tcl programming language. It has a very good section on threads (and
    many other topics).

    See here: https://www.magicsplat.com/

    I purchased both the paper book and the pdf version (great for searching).

    **************************


    My wordlist file has 1,248,300 lines. Each line is a word, yes.

    I am currently using lsearch -nocase for the lookups. Do you know for
    a fact that searching an array is faster than searching a list?
    I'm not confident enough in my own methods to measure these things.

    You see, case will matter if I use [info exists dictionary($someword)]. Handling case in that scenario will also add overhead.

    (I just realized I will have to split my word list in two, common words
    and proper names, because proper names must not be all lowercase.)

    But the lookup is no problem. That is fast enough. I spell check every
    word in a sentence (maximum 80 characters) in one fell swoop and it
    never feels slow at all.

    No corrections or any kind of guessing though, just checking whether
    the words exist or not. The correction suggestion part is currently
    in the R&D stage. Well, just R, no D yet.

    The bottleneck is definitely in loading the word list.

    I did a test without trim and found that the words are not found and a
    false misspell is flagged for everything. I soon realized that the
    newlines become part of each word so the upshot is they invalidate the
    entire dataset. I really have to axe them.

    But I will try your file reading code and see if it's faster.

    Either way, I would like the opportunity to learn about threads.

    I really can't afford any book right now. I don't want to go into
    details, suffice to say that I am in very very bad financial condition
    right now. Like, really, no joke.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Mon Dec 11 23:35:45 2023
    On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:

    But... I have a 400k word dictionary I got online somewhere. It's one word >per line. No spaces, nothing to trim and no blank lines to remove. When I >timed your code against a simpler,

    set _fp [open $::WordlistFile r]
    set data [read -nonewline $_fp]
    set Wordlist [split $data \n]
    close $_fp
    **************************


    Well, I can say this, your code is many times as fast as mine.
    Very noticeable difference.

    I will be using that approach from now on.

    Thank you.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Luc on Mon Dec 11 22:10:08 2023
    On 12/11/2023 5:56 PM, Luc wrote:
    On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:

    What is /home/tcl/longlist.txt

    Are these the words to lookup or your spelling dictionary. I hate to
    assume. Based on your variable names, I can't tell which it is.

    But... I have a 400k word dictionary I got online somewhere. It's one word >> per line. No spaces, nothing to trim and no blank lines to remove. When I
    timed your code against a simpler,

    set _fp [open $::WordlistFile r]
    set data [read -nonewline $_fp]
    set Wordlist [split $data \n]
    close $_fp

    It went from 700ms to 50ms to create Wordlist. But again, not knowing for
    sure what that file really is....

    For example, if it is your spelling dictionary, I would preprocess it so
    the trim and blank line tests wouldn't be needed, and then I would store
    it in a tcl array as a hash table where an [info exist
    dictionary($someword)] could be used to check spelling.

    As to threads, I would *highly* recommend you get a copy of Ashok's book,
    The tcl programming language. It has a very good section on threads (and
    many other topics).

    See here: https://www.magicsplat.com/

    I purchased both the paper book and the pdf version (great for searching). >>
    **************************


    My wordlist file has 1,248,300 lines. Each line is a word, yes.

    I am currently using lsearch -nocase for the lookups. Do you know for
    a fact that searching an array is faster than searching a list?
    I'm not confident enough in my own methods to measure these things.

    You see, case will matter if I use [info exists dictionary($someword)]. Handling case in that scenario will also add overhead.

    (I just realized I will have to split my word list in two, common words
    and proper names, because proper names must not be all lowercase.)

    But the lookup is no problem. That is fast enough. I spell check every
    word in a sentence (maximum 80 characters) in one fell swoop and it
    never feels slow at all.

    No corrections or any kind of guessing though, just checking whether
    the words exist or not. The correction suggestion part is currently
    in the R&D stage. Well, just R, no D yet.

    The bottleneck is definitely in loading the word list.

    I did a test without trim and found that the words are not found and a
    false misspell is flagged for everything. I soon realized that the
    newlines become part of each word so the upshot is they invalidate the
    entire dataset. I really have to axe them.

    But I will try your file reading code and see if it's faster.

    Either way, I would like the opportunity to learn about threads.

    I really can't afford any book right now. I don't want to go into
    details, suffice to say that I am in very very bad financial condition
    right now. Like, really, no joke.



    Ok, I guessed right, the list of words are a dictionary of sorts.

    You can always just use the [time] command to measure anything you're trying out. Tcl arrays are (afaik) implemented as hash tables so they should be faster for lookup than lsearch.

    Also, you can create codes, in addition to inclusion in a list. For example:

    set dictionary(Tcl) 1
    set dictionary(tcl) 0

    to indicate one has a capital letter at the beginning.

    As to threads. Here's my 2 cents:

    I would create the thread like so:

    package require Thread

    set tid [thread::create {
    # init some variables here (they are global)
    proc load {args} {
    #...
    return $args/load
    }
    proc lookup {args} {
    #...
    return $args/lookup
    }
    proc maintid {tid} { ;# save main's tid so can thread::send back to main later
    set ::maintid $tid
    return $tid/maintid
    }
    # more procs
    thread::wait
    }]

    # and then use it like this:


    set filename "listofwords.txt"
    thread::send -async $tid [list load $filename] result1 ;# start the dictionary loading
    # .... can do other inits here
    vwait result1

    set word "theword"
    thread::send $tid [list lookup $word] result2

    thread::send $tid [list maintid [thread::id]] result3 ;# send in main thread id


    Here's the output from the above

    --- filename = |listofwords.txt|
    --- result1 = |listofwords.txt/load|
    --- result2 = |theword/lookup|
    --- result3 = |tid000030DC/maintid|
    --- tid = |tid000017DC|
    --- word = |theword|

    This is my preference for structuring threads; it makes them look just like procs.

    In the above, you can call the load one time, -async, and when you are done loading everything else, you would wait on that result variable.

    But... There is a *gotcha*. If you vwait *AFTER* the value has already been set by the thread's return send, you will wait forever.

    So.... I always do a 2-step:

    unset -nocomplain result
    thread::send -async .... result
    #... can do other things here even update or vwait on other variables
    if {![info exist result]} {vwait result} ;# when you are ready to wait for the result

    With this code, you avoid that possible race condition where you allow the thread result to get set in between your thread::send and t
  • From Rich@21:1/5 to Luc on Tue Dec 12 16:38:54 2023
    Luc <[email protected]d> wrote:
    The last mile I need, I think, is retrieving data from the thread.
    More specifically, ::Wordlist after it is built.

    There are several ways to retreive data from a thread.

    1) use the 'tsv' module to setup a shared variable, have the sub-thread
    store results in the shared variable, and have the main thread retreive
    the results from the shared variable (but will sometimes need a way to synchronize the two).

    2) don't use the -async option to thread::send, and just have the
    result be 'returned'. But this makes the main thread wait, which
    removes most of the value of running the sub-thread to gain
    parallelism.

    3) use the optional ?varname? with thread::send and -async to have the
    result returned in varname, and vwait later to pickup the result.

    4) send the data to process, and the id of the thread doing the send to
    the sub-thread, and write the sub-thread to perform a thread::send to
    the supplied id to set a variable with the result. I.e. something
    like:

    sending thread:
    thread::send -async tid2 [list do-process $data [thread::id]]

    receiving thread:
    proc do-process {data tid} {
    set result [do-something-with $data]
    thread::send -async $tid [list set ::thread_result $result]
    }

    And there are probably more ways to "get back" the result than what
    I've detailed above.

    Do note that using threads will also bring you into the world of
    parallell processing with all the complexity that implies, including
    the need for synchronization at times. The
    thread::(mutex|rwmutex|cond) commands provide the sync. primitives, but
    you do have to use them as appropriate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Gerald Lester on Tue Dec 12 13:19:17 2023
    On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:

    On 12/11/23 19:56, Luc wrote:
    On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
    ...

    My wordlist file has 1,248,300 lines. Each line is a word, yes.

    I am currently using lsearch -nocase for the lookups. Do you know for
    a fact that searching an array is faster than searching a list?

    For a large list yes!

    That all being said, you may want to step back and consider alternatives.

    Suggestion: use SQLite...

    1) Build "offline" (i.e. before you run your application) a SQLite DB
    with a table that has one column each row one of your words.
    2) Have your application open the SQLite Db and do searches on the table.

    **************************

    I thought about that. I just decided that plain txt files are easier
    (or should I say more convenient) to manage. I know I will be adding
    items to them as time goes by.

    And like I said, I have zero problem with the lookup time. It's working
    very fast, no delay whatsoever and I'm even running two queries on most
    of the words.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Tue Dec 12 16:26:20 2023
    Luc <[email protected]d> wrote:
    On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:

    But... I have a 400k word dictionary I got online somewhere. It's one word >>per line. No spaces, nothing to trim and no blank lines to remove. When I >>timed your code against a simpler,

    set _fp [open $::WordlistFile r]
    set data [read -nonewline $_fp]
    set Wordlist [split $data \n]
    close $_fp
    **************************


    Well, I can say this, your code is many times as fast as mine.
    Very noticeable difference.

    I will be using that approach from now on.

    Your code was iterating a loop, reading one line at a time, and doing
    trims and checks for blank lines, for each line in your file (which you
    said was over 1M lines long in another post.

    So you were looping, in Tcl, over one million times, running the code
    within your loop (a string trim and a check for blank lines).

    et99's code is doing a single 'read' to bring the entire file in all at
    once (much more IO efficient than reading one line at a time -- but
    trades off against higher memory use).

    Then he does a single 'split' on newliness to create the list.

    So he is running two Tcl statements, with most of the 'work' being done
    inside the C code which impliments those two commands, vs. running over
    one million iterations of Tcl code to process one word at a time.

    That is why et99's version is faster. But, et99's version presumes
    that the words in the file are already "trimmed" and no blank lines
    between any words. Both of which are reasonable to do as a "one-time" preprocess of the words, so as to not have to repeat it each time the
    file is loaded.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Tue Dec 12 16:41:50 2023
    Luc <[email protected]d> wrote:
    On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:

    On 12/11/23 19:56, Luc wrote:
    On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
    ...

    My wordlist file has 1,248,300 lines. Each line is a word, yes.

    I am currently using lsearch -nocase for the lookups. Do you know for
    a fact that searching an array is faster than searching a list?

    For a large list yes!

    That all being said, you may want to step back and consider alternatives.

    Suggestion: use SQLite...

    1) Build "offline" (i.e. before you run your application) a SQLite DB
    with a table that has one column each row one of your words.
    2) Have your application open the SQLite Db and do searches on the table.

    **************************

    I thought about that. I just decided that plain txt files are easier
    (or should I say more convenient) to manage. I know I will be adding
    items to them as time goes by.

    And like I said, I have zero problem with the lookup time. It's working
    very fast, no delay whatsoever and I'm even running two queries on most
    of the words.

    You can keep your 'plain text' file, just setup a process to
    'regenerate' the sqlite database whenever you update the plain text
    file.

    The advantage you get with sqlite is that all the preprocessing is done
    ahead of time, and you only incur the "lookup time" when you do a
    lookup.

    A second advantage is your word list could be much larger than what you
    can hold in memory when it is in a sqlite DB (although this advantage
    has shrunk given the huge amount of RAM in modern systems).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Tue Dec 12 16:19:30 2023
    Luc <[email protected]d> wrote:
    I am currently using lsearch -nocase for the lookups. Do you know for
    a fact that searching an array is faster than searching a list?
    I'm not confident enough in my own methods to measure these things.

    You don't need to be confident in your own methods. Tcl provides a
    'time' command that does all the work for you, and reports the time
    taken. So just mock up both versions, and use [time] to time
    executiong of a bunch of lookups, and see which is faster.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to [email protected] on Tue Dec 12 20:49:41 2023
    et99 <[email protected]> wrote:
    I would think with lsearch, not finding a word might take the
    longest, if it's doing a sequential search.

    For the default, yes, because the default is a sequential search.

    But, if your list elements are sorted, you can use the "-sorted"
    option, which speeds it up. The man page simply says "will use a more efficient searching algorithm to search list", I suspect "-sorted"
    turns on a binary search of the list elements.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Rich on Tue Dec 12 13:14:19 2023
    On 12/12/2023 12:49 PM, Rich wrote:
    et99 <[email protected]> wrote:
    I would think with lsearch, not finding a word might take the
    longest, if it's doing a sequential search.

    For the default, yes, because the default is a sequential search.

    But, if your list elements are sorted, you can use the "-sorted"
    option, which speeds it up. The man page simply says "will use a more efficient searching algorithm to search list", I suspect "-sorted"
    turns on a binary search of the list elements.


    Ahhh, good point. A binary search might only be doing around 20 compares on a 1mil word list.

    % lsearch $Wordlist2 hello
    154092
    % lsearch -sorted $Wordlist2 hello
    154092
    % lsearch $Wordlist2 hellox
    -1

    % time {lsearch $Wordlist2 hello}
    1127 microseconds per iteration
    % time {lsearch -sorted $Wordlist2 hello}
    12 microseconds per iteration
    % time {lsearch -sorted $Wordlist2 hellox}
    12 microseconds per iteration

    and when run 50 times,

    % time {lsearch -sorted $Wordlist2 hello} 50
    0.78 microseconds per iteration
    % time {lsearch -sorted $Wordlist2 hellox} 50
    0.66 microseconds per iteration


    Hah! so, no need for an array and a hash lookup.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Luc on Tue Dec 12 12:39:28 2023
    On 12/12/2023 8:19 AM, Luc wrote:
    On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:

    On 12/11/23 19:56, Luc wrote:
    On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
    ...

    My wordlist file has 1,248,300 lines. Each line is a word, yes.

    I am currently using lsearch -nocase for the lookups. Do you know for
    a fact that searching an array is faster than searching a list?

    For a large list yes!

    That all being said, you may want to step back and consider alternatives.

    Suggestion: use SQLite...

    1) Build "offline" (i.e. before you run your application) a SQLite DB
    with a table that has one column each row one of your words.
    2) Have your application open the SQLite Db and do searches on the table.

    **************************

    I thought about that. I just decided that plain txt files are easier
    (or should I say more convenient) to manage. I know I will be adding
    items to them as time goes by.

    And like I said, I have zero problem with the lookup time. It's working
    very fast, no delay whatsoever and I'm even running two queries on most
    of the words.


    One reason you may want a very fast lookup is you may want to eventually also make suggestions.

    One way is to take a mis-spelled word and make changes, like reverse each pair of letters, add a letter and subtract a letter at all possible locations in the word, and then lookup the newly formed words. That's where having a very fast lookup can help.

    Anyway, that's how I once did a spell checker. It worked decently well.

    Incidentally, using a tsv shared variable array can be loaded like so:


    proc load {WordlistFile} {
    set _fp [open $WordlistFile r]
    set data [read -nonewline $_fp]
    set Wordlist2 [split $data \n]
    foreach word $Wordlist2 {
    tsv::set dictionary $word 1
    }
    close $_fp

    }

    This takes only about 300ms to load my 400k dictionary. (A regular array was 200ms).

    Then lookups are on the order of a few microseconds:


    proc lookup {word} {
    return [tsv::exists dictionary $word]
    }

    % lookup hello
    1
    % lookup hellox
    0
    % time {lookup hello}
    8 microseconds per iteration
    % time {lookup hellox}
    7 microseconds per iteration

    I would think with lsearch, not finding a word might take the longest, if it's doing a sequential search.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Tue Dec 12 19:08:01 2023
    On Tue, 12 Dec 2023 12:39:28 -0800, et99 wrote:

    One reason you may want a very fast lookup is you may want to eventually
    also make suggestions.


    I still have to investigate what algorithms there are out there. I tried
    a couple of my own and they failed quite beautifully.

    Anyway, I spent some time testing my newly acquired word list by looking
    up all kinds of far fetched words to see how thorough it was. Out of convenience, I did it like this:

    proc p.findword {word} {
    puts -nonewline [lsearch -nocase $::BIGLIST $word]
    }
    foreach w {word this that something some other whatever dunno} {
    puts "[p.findword $w] $w"
    }

    I went as far as 10 words at a time, and the output is instantaneous
    for all the 10 lookups. With my old algorithm to read the list file
    (the one million iterations), there was a very noticeable pause before
    the output, but now that is instantaneous too.

    So no worries about speed yet.


    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Rich on Tue Dec 12 18:59:13 2023
    On Tue, 12 Dec 2023 16:41:50 -0000 (UTC), Rich wrote:

    You can keep your 'plain text' file, just setup a process to
    'regenerate' the sqlite database whenever you update the plain text
    file.

    The advantage you get with sqlite is that all the preprocessing is done
    ahead of time, and you only incur the "lookup time" when you do a
    lookup.

    A second advantage is your word list could be much larger than what you
    can hold in memory when it is in a sqlite DB (although this advantage
    has shrunk given the huge amount of RAM in modern systems).

    **************************

    Great tips, I always learn a lot here. Thank you.

    However, I don't understand what you mean by "all the preprocessing is
    done ahead of time."

    What preprocessing? The file is "slurped" once at launch then the word
    list is permanently available in a list. Why would acess to that list
    (in memory, I assume) be slower to access to a database (on disk, for
    sure)?



    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Wed Dec 13 00:05:41 2023
    Luc <[email protected]d> wrote:
    On Tue, 12 Dec 2023 16:41:50 -0000 (UTC), Rich wrote:

    You can keep your 'plain text' file, just setup a process to
    'regenerate' the sqlite database whenever you update the plain text
    file.

    The advantage you get with sqlite is that all the preprocessing is done >>ahead of time, and you only incur the "lookup time" when you do a
    lookup.

    A second advantage is your word list could be much larger than what you
    can hold in memory when it is in a sqlite DB (although this advantage
    has shrunk given the huge amount of RAM in modern systems).

    **************************

    Great tips, I always learn a lot here. Thank you.

    However, I don't understand what you mean by "all the preprocessing is
    done ahead of time."

    What preprocessing?

    Trimming spaces from the words only needs to be done once. Splitting
    lines into words only needs to be done once.

    The file is "slurped" once at launch then the word list is
    permanently available in a list.

    Part of your slurp involves a split on \n to convert the lines into
    words. While that step, on modern CPU's, is fast enough you get the
    luxury to ignore it, it is "preprocessing" that you are performing
    every time you slurp the file in from disk.

    Why would acess to that list (in memory, I assume) be slower to
    access to a database (on disk, for sure)?

    For things that fit in ram, and a list, and provided you have the list
    sorted, and use the -sorted option to list, then lookups in the list
    likely will beat sqlite. But, if the wordlist grows too large for
    memory (this is unlikely for your specific use case, but for other
    kinds of "data" is very common) or you don't keep it sorted so you have
    to use lsearch's linear search then sqlite (provided you tell sqlite to
    index the lookup column) will beat the list method in most cases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Rich on Tue Dec 12 21:36:44 2023
    On Wed, 13 Dec 2023 00:05:41 -0000 (UTC), Rich wrote:

    For things that fit in ram, and a list, and provided you have the list >sorted, and use the -sorted option to list, then lookups in the list
    likely will beat sqlite. But, if the wordlist grows too large for
    memory (this is unlikely for your specific use case, but for other
    kinds of "data" is very common) or you don't keep it sorted so you have
    to use lsearch's linear search then sqlite (provided you tell sqlite to
    index the lookup column) will beat the list method in most cases.

    **************************

    Do you by any chance know what happens if I use lsearch -sorted on a
    list that

    A. is not perfectly or completely sorted (new items have been added to
    the end)

    B. I run the garden variety GNU 'sort' command on the word list file
    so it may not comply exactly with whatever lsearch thinks should be
    considered "sorted" (ascii, alnum, etc.)?

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Luc on Tue Dec 12 21:51:14 2023
    On Tue, 12 Dec 2023 21:36:44 -0300, Luc wrote:

    Do you by any chance know what happens if I use lsearch -sorted on a
    list that

    A. is not perfectly or completely sorted (new items have been added to
    the end)

    B. I run the garden variety GNU 'sort' command on the word list file
    so it may not comply exactly with whatever lsearch thinks should be >considered "sorted" (ascii, alnum, etc.)?

    **************************

    Whoa. I don't know what is going on, but something is going on and
    it's bad.

    proc p.findword {word} {
    puts -nonewline [lsearch -nocase $::BIGLIST $word]
    }
    foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
    puts "[p.findword $w] $w"
    }


    9 out of 10 words found.

    Now, using -sorted:

    proc p.findword {word} {
    puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
    }
    foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
    puts "[p.findword $w] $w"
    }


    Visibly faster, but only 3 out of 10 words found.

    Not good.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Luc on Tue Dec 12 18:28:23 2023
    On 12/12/2023 4:51 PM, Luc wrote:
    On Tue, 12 Dec 2023 21:36:44 -0300, Luc wrote:

    Do you by any chance know what happens if I use lsearch -sorted on a
    list that

    A. is not perfectly or completely sorted (new items have been added to
    the end)

    B. I run the garden variety GNU 'sort' command on the word list file
    so it may not comply exactly with whatever lsearch thinks should be
    considered "sorted" (ascii, alnum, etc.)?

    **************************

    Whoa. I don't know what is going on, but something is going on and
    it's bad.

    proc p.findword {word} {
    puts -nonewline [lsearch -nocase $::BIGLIST $word]
    }
    foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
    puts "[p.findword $w] $w"
    }


    9 out of 10 words found.

    Now, using -sorted:

    proc p.findword {word} {
    puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
    }
    foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
    puts "[p.findword $w] $w"
    }


    Visibly faster, but only 3 out of 10 words found.

    Not good.


    What I meant by pre-processing was to take your list as cleaned up, sorted, etc. and write it out, once. Thereafter, you could use the read/split to restore it to memory quickly.

    If, however, you are going to be adding words during a run, you could just keep 2 lists. The second list would likely be very short if added by the user during a session. Merging new words in might be a pain, and re-sorting the entire list likewise.

    On the other hand, this is a plus for using the array, since order isn't important there, as it's just hashing them.

    But are you also going to let the user do a "save dictionary" after adding in new words? Programs never do stay simple :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Wed Dec 13 03:00:28 2023
    Luc <[email protected]d> wrote:
    Whoa. I don't know what is going on, but something is going on and
    it's bad.

    proc p.findword {word} {
    puts -nonewline [lsearch -nocase $::BIGLIST $word]
    }
    foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
    puts "[p.findword $w] $w"
    }


    9 out of 10 words found.

    Now, using -sorted:

    proc p.findword {word} {
    puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
    }
    foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
    puts "[p.findword $w] $w"
    }

    When you use "-sorted", ::BIGLIST is, in fact, sorted, right?

    Visibly faster, but only 3 out of 10 words found.

    Not good.

    Given the reduction in hits, this implies you do not have ::BIGLIST
    sorted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Wed Dec 13 02:58:43 2023
    Luc <[email protected]d> wrote:
    On Wed, 13 Dec 2023 00:05:41 -0000 (UTC), Rich wrote:

    For things that fit in ram, and a list, and provided you have the list >>sorted, and use the -sorted option to list, then lookups in the list
    likely will beat sqlite. But, if the wordlist grows too large for
    memory (this is unlikely for your specific use case, but for other
    kinds of "data" is very common) or you don't keep it sorted so you have
    to use lsearch's linear search then sqlite (provided you tell sqlite to >>index the lookup column) will beat the list method in most cases.

    **************************

    Do you by any chance know what happens if I use lsearch -sorted on a
    list that

    A. is not perfectly or completely sorted (new items have been added to
    the end)

    Most likely the search will complete, but it may or may not find the
    requested element.

    Try creating a random ordered list from your long word list, then do a
    lsearch -sorted on it and see what happens. ::struct::list shuffle
    from Tcllib can randomly order your word list for you.

    B. I run the garden variety GNU 'sort' command on the word list file
    so it may not comply exactly with whatever lsearch thinks should be considered "sorted" (ascii, alnum, etc.)?

    The manpage implies that -sorted expects the list to be sorted via the
    -ascii comparison operator of lsort. Since this is unicode code point
    order, it likely exactly matches that which gnu sort produces in its
    default state.

    But there are only two possibilities:

    1) gnu sort sorts in the order that lsearch assumes for -sorted - in
    which case you get back correct answers

    2) gnu sort sorts slightly differently -- in which case you may get
    back any of:

    no answers at all (i.e. always a 'not found')

    the wrong answers (although it is much more likely to simply return
    "not found")

    sometimes the right answer, sometimes a false "not found"

    This third one would occur when the list is almost in the right order,
    but only a few elements differ. If what you search for ends up being
    in a segmennt that is ordered as expected, you get the right answer (or
    a true "not found" answer). If what you search for ends up being in a
    segment that is not ordered as expected, you most likely get back a
    false not found.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Rich on Wed Dec 13 08:54:35 2023
    On Wed, 13 Dec 2023 03:00:28 -0000 (UTC), Rich wrote:

    When you use "-sorted", ::BIGLIST is, in fact, sorted, right?

    Visibly faster, but only 3 out of 10 words found.

    Not good.

    Given the reduction in hits, this implies you do not have ::BIGLIST
    sorted.
    **************************

    ::BIGLIST is slurped straight from the file which was a merge of multiple
    word lists and dictionaries I found here and there, then sorted with
    sort -u to remove the duplicates.

    So it is sorted, but I guess it's not sorted in the way that lsearch
    expects.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Wed Dec 13 09:10:38 2023
    On Tue, 12 Dec 2023 18:28:23 -0800, et99 wrote:

    What I meant by pre-processing was to take your list as cleaned up,
    sorted, etc. and write it out, once. Thereafter, you could use the
    read/split to restore it to memory quickly.

    If, however, you are going to be adding words during a run, you could just >keep 2 lists. The second list would likely be very short if added by the
    user during a session. Merging new words in might be a pain, and
    re-sorting the entire list likewise.

    On the other hand, this is a plus for using the array, since order isn't >important there, as it's just hashing them.

    But are you also going to let the user do a "save dictionary" after adding
    in new words? Programs never do stay simple :)

    **************************

    Well, yes. It's done and in production already.

    You see, names are simple. They have to begin with a capital letter.

    But "begin" means it can be either Mary or MARY. For that I need some
    kind of -nocase parameter or one normalization step plus a second lookup.
    That may or may not defeat the superior speed of array lookups or more
    likely just make the difference less meaningful.

    Common words are less simple. In the beginning of a sentence, they must
    begin with a capital letter. In the middle of a sentence, they must
    begin with a small letter. But in either case it may be all upper case
    too.

    The shortest route I could think of was two lists: things and names.
    1. Search in the first list with no case and that's it.
    2. Not found? Search in the second list as is and that's it.
    3. Still not found? Capitalize it and look for it again in the list
    of names.

    In case you're wondering, the problem of capitalizing words (or not)
    according to punctuation is taken care of by a completely different proc
    that does auto correct according to another list. I actually use the
    concept of auto correct to auto expand abbreviations and type faster.
    That proc takes care of capitalization according to punctuation.
    In a public application that would not be good enough, but since this
    is for private use and is working as intended, I won't bother fixing
    what ain't broken.

    But another problem comes up.

    In my current design, boxes with any problem cannot be approved and I am
    not allowed to jump to the next one until the problem is properly fixed.
    A "problem" currently means too many characters or an empty box. Empty
    boxes may be desirable in certain circumstances so there is a "force"
    command (and key shortcut) in case I want to override it. Misspellings
    will just be a third kind of problem.

    Workflow speed is always a priority with this thing so I implemented the possibility of a double override action. The first override key press
    will add all unknown words to the word list and the second override will "approve and move forward."

    But then I can't distinguish things from names. I can, but I guess I
    would have to introduce a pop-up to decide which one every time. That
    would slow things down. I though that maybe it would be better to just
    use one global word list and take care of casing with my own human proofreading.

    Then again, unknown words are highly likely to be proper names so I
    decided to detect their case and send them straight to the names list
    if they are written with a capital letter whether it's a name or not.
    If they are not a name and happen to show up again in small letters,
    then I will add them again, in which case they will go to the word list.

    Now words or names are always added twice: to the list in memory and
    appended to the file on disk.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Luc on Wed Dec 13 09:15:43 2023
    On Wed, 13 Dec 2023 08:54:35 -0300, Luc wrote:

    ::BIGLIST is slurped straight from the file which was a merge of multiple >word lists and dictionaries I found here and there, then sorted with
    sort -u to remove the duplicates.

    So it is sorted, but I guess it's not sorted in the way that lsearch
    expects.

    **************************

    Well, I added an lsort step to the file slurp procedure and now using
    lsort -nocase -sorted yields all the expected search hits.

    And everything is still very fast.

    So I guess that much is taken care of.

    Thank you again.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Wed Dec 13 14:26:29 2023
    Luc <[email protected]d> wrote:
    On Wed, 13 Dec 2023 03:00:28 -0000 (UTC), Rich wrote:

    When you use "-sorted", ::BIGLIST is, in fact, sorted, right?

    Visibly faster, but only 3 out of 10 words found.

    Not good.

    Given the reduction in hits, this implies you do not have ::BIGLIST
    sorted.
    **************************

    ::BIGLIST is slurped straight from the file which was a merge of multiple word lists and dictionaries I found here and there, then sorted with
    sort -u to remove the duplicates.

    So it is sorted, but I guess it's not sorted in the way that lsearch
    expects.

    That would be my guess.

    Slurp it in, sort it with lsort, then save the result to a second file
    (note you'll want to use [join] and join with \n when saving to a
    second file.

    Then run "diff -u" on both files (and being 1m lines, pipe the result
    to less) and see if there is a difference, and what is different.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Wed Dec 13 14:53:37 2023
    Luc <[email protected]d> wrote:
    On Wed, 13 Dec 2023 08:54:35 -0300, Luc wrote:

    ::BIGLIST is slurped straight from the file which was a merge of multiple >>word lists and dictionaries I found here and there, then sorted with
    sort -u to remove the duplicates.

    So it is sorted, but I guess it's not sorted in the way that lsearch >>expects.

    **************************

    Well, I added an lsort step to the file slurp procedure and now using
    lsort -nocase -sorted yields all the expected search hits.

    Do note that if you lsort before saving, then you don't have to lsort
    upon loading until such time as you add more words.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Luc on Wed Dec 13 13:33:45 2023
    On 12/13/2023 4:10 AM, Luc wrote:
    On Tue, 12 Dec 2023 18:28:23 -0800, et99 wrote:

    What I meant by pre-processing was to take your list as cleaned up,
    sorted, etc. and write it out, once. Thereafter, you could use the
    read/split to restore it to memory quickly.

    If, however, you are going to be adding words during a run, you could just >> keep 2 lists. The second list would likely be very short if added by the
    user during a session. Merging new words in might be a pain, and
    re-sorting the entire list likewise.

    On the other hand, this is a plus for using the array, since order isn't
    important there, as it's just hashing them.

    But are you also going to let the user do a "save dictionary" after adding >> in new words? Programs never do stay simple :)

    **************************

    Well, yes. It's done and in production already.

    You see, names are simple. They have to begin with a capital letter.

    But "begin" means it can be either Mary or MARY. For that I need some
    kind of -nocase parameter or one normalization step plus a second lookup. That may or may not defeat the superior speed of array lookups or more
    likely just make the difference less meaningful.

    Common words are less simple. In the beginning of a sentence, they must
    begin with a capital letter. In the middle of a sentence, they must
    begin with a small letter. But in either case it may be all upper case
    too.

    The shortest route I could think of was two lists: things and names.
    1. Search in the first list with no case and that's it.
    2. Not found? Search in the second list as is and that's it.
    3. Still not found? Capitalize it and look for it again in the list
    of names.

    In case you're wondering, the problem of capitalizing words (or not) according to punctuation is taken care of by a completely different proc
    that does auto correct according to another list. I actually use the
    concept of auto correct to auto expand abbreviations and type faster.
    That proc takes care of capitalization according to punctuation.
    In a public application that would not be good enough, but since this
    is for private use and is working as intended, I won't bother fixing
    what ain't broken.

    But another problem comes up.

    In my current design, boxes with any problem cannot be approved and I am
    not allowed to jump to the next one until the problem is properly fixed.
    A "problem" currently means too many characters or an empty box. Empty
    boxes may be desirable in certain circumstances so there is a "force"
    command (and key shortcut) in case I want to override it. Misspellings
    will just be a third kind of problem.

    Workflow speed is always a priority with this thing so I implemented the possibility of a double override action. The first override key press
    will add all unknown words to the word list and the second override will "approve and move forward."

    But then I can't distinguish things from names. I can, but I guess I
    would have to introduce a pop-up to decide which one every time. That
    would slow things down. I though that maybe it would be better to just
    use one global word list and take care of casing with my own human proofreading.

    Then again, unknown words are highly likely to be proper names so I
    decided to detect their case and send them straight to the names list
    if they are written with a capital letter whether it's a name or not.
    If they are not a name and happen to show up again in small letters,
    then I will add them again, in which case they will go to the word list.

    Now words or names are always added twice: to the list in memory and
    appended to the file on disk.



    On your 3 step approach:

    What about words that can be both, like Cat Stevens and I have a cat; Drew and drew a picture. Will you accept the user's case choice in that situation?

    I see this as 2 different problems. Checking spelling and checking capitalization. You mention you already have separate procs for this.

    As to sorting your list(s). What is the purpose other than to make lsearch lookup faster? As to doing a -nocase search, how is that any different than maintaining your list(s) in a single case and converting to that case before lookup?

    So, two choices, use an array or use a list.

    Here's my check list:

    List:

    Must be kept sorted for speed.
    Adding words needs to re-sort or merge.
    Duplicates could be a problem (but I'm not sure)
    Can have more than 1 list

    Array:

    No need to sort for speed
    Need to keep in one case (say lower) and convert to (lower) before lookup Adding words is no problem, just make it lowercase first
    No problems with duplicates, adding the same word is a no-op
    Can have multiple arrays


    As to the need to have an in memory plus a disk file would mean:

    List:

    Have to resort when adding new words

    Array:

    Just add any new word to the end of the file. No need to sort.


    As to loading up with a separate thread. The tsv variables are arrays. A 2nd thread can load up a tsv array from a file and the main thread can see that shared array without needing to do lookups in the second thread (as I mistakenly wrote in a prior
    post).

    My timings for tsv arrays with [tsv::exists dictionary $word] show it's just as fast as a non-shared array.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Wed Dec 13 19:33:27 2023
    On Wed, 13 Dec 2023 13:33:45 -0800, et99 wrote:

    On your 3 step approach:

    What about words that can be both, like Cat Stevens and I have a cat; Drew >and drew a picture. Will you accept the user's case choice in that
    situation?

    First, the user (aka "yours truly") gets a warning that a certain word is unknown. The user decides whether that word really is a typo or a new word
    that should be added to the spelling dictionary. It's a human in control.
    Which is good.

    My word list is very thourough. I had a little fun looking for weird words
    and only two or three were missing and all of them were slang. I had to
    cheat a little to find something missing. So what kind of words will be
    flagged as unknown although being correct? Names. I reckon about 99% of
    the times.

    Remember, I want to do things fast for optimum productivity. Being able
    to press one key and add the word to its proper list automatically is considered valuable. If the word begins with capital case, it's even more likely to be a name. The odds of an unknown word that is not a name being
    the first one in a sentence are very slim. But if it doesn't have a
    capital letter then it has to be a word.

    So maybe it is not a name. So maybe it was the first word in a sentence.
    Well, it's a very rare exception and I don't mind adding it to the list
    of names. The list won't be "tarnished." I don't care. The worst that
    can happen is it shows up again (hopefully in small letter this time) and
    I have to add it again. No big deal. It's a very rare occurrence.

    The separation of things and names is good because it won't flag "Mary,"
    but it will flag "mary." I do want that feature. The only problem is it
    will not flag "john" (in case it refers to the restroom), but oh well,
    this is a typical corner case and no spelling checking system is perfect.
    I know I have worked with a lot of them. They always need human
    supervision.


    So, two choices, use an array or use a list.

    I will investigate the choice of array, but later. I have other
    priorities right now. Like I said, I don't have a problem with lookup
    speed. I had a problem with loading the list off a file at program
    launch, but your code really solved that one. :-)

    Threads are going to the back burner too. I don't need them now. I
    copied and saved all the advice that wasshared here. I will have another
    look at it eventually. I want to learn it.


    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)