Forum: >>> Magnum BBS <<<

Retrieving data from a thread

From Luc@21:1/5 to All on Mon Dec 11 18:34:16 2023

Hi.

My application used to take 7 seconds to load. A little long.

Now I am trying to implement some kind of spell checker so I just added
code to load and process a file.

set ::WordlistFile /home/tcl/longlist.txt
set ::Wordlist ""
set _fp [open $::WordlistFile r]
while {![eof $_fp]} {
set _line [string trim [gets $_fp]]
if {$_line == ""} {continue}
lappend ::Wordlist $_line
}
close $_fp

Now it takes 10 seconds to load. Grumble.

I thought about using threads to load the word list and make it all
load faster.

In fact, it already loads another list. Maybe I could "threadify" both?

Anyway,

I've been reading about threads and I must say the existing documentation
is not very easy to understand. Code examples on google are pretty scarce
too. I found an interesting one on StackOverflow where Donal (DKF)
suggests using pools, but I couldn't make that work. I mean, I have to
"get" data with tpool::get? But when? How do I know when the job is done?
He also suggests tsv, but I found the relevant documentation hard to read
and understand.

Yet I've made prototypes that almost worked. Almost.

The last mile I need, I think, is retrieving data from the thread. More specifically, ::Wordlist after it is built.

I wish I didn't have to call it explicitly. Just let the thread build
and set ::Wordlist whenever it feels ready. My goal is to let the thread
load the list while the rest of the application loads other things.
Of course, it must not take longer than a few seconds. It has to be done
when I begin to type.

Among so many attempts, I only came up with one that worked. But
the application wouldn't load any faster. I must have done something
wrong. Can you please enlighten me?

package require Thread
set thr [thread::create]
thread::send $thr "set wordlistfile $::WordlistFile"
thread::send -async $thr {
set _fp [open $wordlistfile r]
while {![eof $_fp]} {
set _line [string trim [gets $_fp]]
if {$_line == ""} {continue}
lappend ::Wordlist $_line
}
close $_fp
}
puts "Now what?"

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Luc on Mon Dec 11 16:24:36 2023

On 12/11/2023 1:34 PM, Luc wrote:

Hi.

My application used to take 7 seconds to load. A little long.

Now I am trying to implement some kind of spell checker so I just added
code to load and process a file.

set ::WordlistFile /home/tcl/longlist.txt
set ::Wordlist ""
set _fp [open $::WordlistFile r]
while {![eof $_fp]} {
set _line [string trim [gets $_fp]]
if {$_line == ""} {continue}
lappend ::Wordlist $_line
}
close $_fp

Now it takes 10 seconds to load. Grumble.

I thought about using threads to load the word list and make it all
load faster.

In fact, it already loads another list. Maybe I could "threadify" both?

Anyway,

I've been reading about threads and I must say the existing documentation
is not very easy to understand. Code examples on google are pretty scarce too. I found an interesting one on StackOverflow where Donal (DKF)
suggests using pools, but I couldn't make that work. I mean, I have to
"get" data with tpool::get? But when? How do I know when the job is done?
He also suggests tsv, but I found the relevant documentation hard to read
and understand.

Yet I've made prototypes that almost worked. Almost.

The last mile I need, I think, is retrieving data from the thread. More specifically, ::Wordlist after it is built.

I wish I didn't have to call it explicitly. Just let the thread build
and set ::Wordlist whenever it feels ready. My goal is to let the thread
load the list while the rest of the application loads other things.
Of course, it must not take longer than a few seconds. It has to be done
when I begin to type.

Among so many attempts, I only came up with one that worked. But
the application wouldn't load any faster. I must have done something
wrong. Can you please enlighten me?

package require Thread
set thr [thread::create]
thread::send $thr "set wordlistfile $::WordlistFile"
thread::send -async $thr {
set _fp [open $wordlistfile r]
while {![eof $_fp]} {
set _line [string trim [gets $_fp]]
if {$_line == ""} {continue}
lappend ::Wordlist $_line
}
close $_fp
}
puts "Now what?"

What is /home/tcl/longlist.txt

Are these the words to lookup or your spelling dictionary. I hate to assume. Based on your variable names, I can't tell which it is.

But... I have a 400k word dictionary I got online somewhere. It's one word per line. No spaces, nothing to trim and no blank lines to remove. When I timed your code against a simpler,

set _fp [open $::WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist [split $data \n]
close $_fp

It went from 700ms to 50ms to create Wordlist. But again, not knowing for sure what that file really is....

For example, if it is your spelling dictionary, I would preprocess it so the trim and blank line tests wouldn't be needed, and then I would store it in a tcl array as a hash table where an [info exist dictionary($someword)] could be used to check
spelling.

As to threads, I would *highly* recommend you get a copy of Ashok's book, The tcl programming language. It has a very good section on threads (and many other topics).

See here: https://www.magicsplat.com/

I purchased both the paper book and the pdf version (great for searching).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Mon Dec 11 22:56:10 2023

On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:

What is /home/tcl/longlist.txt

Are these the words to lookup or your spelling dictionary. I hate to
assume. Based on your variable names, I can't tell which it is.

But... I have a 400k word dictionary I got online somewhere. It's one word >per line. No spaces, nothing to trim and no blank lines to remove. When I >timed your code against a simpler,

set _fp [open $::WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist [split $data \n]
close $_fp

It went from 700ms to 50ms to create Wordlist. But again, not knowing for >sure what that file really is....

For example, if it is your spelling dictionary, I would preprocess it so
the trim and blank line tests wouldn't be needed, and then I would store
it in a tcl array as a hash table where an [info exist
dictionary($someword)] could be used to check spelling.

As to threads, I would *highly* recommend you get a copy of Ashok's book,
The tcl programming language. It has a very good section on threads (and
many other topics).

See here: https://www.magicsplat.com/

I purchased both the paper book and the pdf version (great for searching).

**************************

My wordlist file has 1,248,300 lines. Each line is a word, yes.

I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?
I'm not confident enough in my own methods to measure these things.

You see, case will matter if I use [info exists dictionary($someword)]. Handling case in that scenario will also add overhead.

(I just realized I will have to split my word list in two, common words
and proper names, because proper names must not be all lowercase.)

But the lookup is no problem. That is fast enough. I spell check every
word in a sentence (maximum 80 characters) in one fell swoop and it
never feels slow at all.

No corrections or any kind of guessing though, just checking whether
the words exist or not. The correction suggestion part is currently
in the R&D stage. Well, just R, no D yet.

The bottleneck is definitely in loading the word list.

I did a test without trim and found that the words are not found and a
false misspell is flagged for everything. I soon realized that the
newlines become part of each word so the upshot is they invalidate the
entire dataset. I really have to axe them.

But I will try your file reading code and see if it's faster.

Either way, I would like the opportunity to learn about threads.

I really can't afford any book right now. I don't want to go into
details, suffice to say that I am in very very bad financial condition
right now. Like, really, no joke.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Mon Dec 11 23:35:45 2023

On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:

But... I have a 400k word dictionary I got online somewhere. It's one word >per line. No spaces, nothing to trim and no blank lines to remove. When I >timed your code against a simpler,

set _fp [open $::WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist [split $data \n]
close $_fp

**************************

Well, I can say this, your code is many times as fast as mine.
Very noticeable difference.

I will be using that approach from now on.

Thank you.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Luc on Mon Dec 11 22:10:08 2023

On 12/11/2023 5:56 PM, Luc wrote:

On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:

What is /home/tcl/longlist.txt

Are these the words to lookup or your spelling dictionary. I hate to
assume. Based on your variable names, I can't tell which it is.

But... I have a 400k word dictionary I got online somewhere. It's one word >> per line. No spaces, nothing to trim and no blank lines to remove. When I
timed your code against a simpler,

set _fp [open $::WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist [split $data \n]
close $_fp

It went from 700ms to 50ms to create Wordlist. But again, not knowing for
sure what that file really is....

For example, if it is your spelling dictionary, I would preprocess it so
the trim and blank line tests wouldn't be needed, and then I would store
it in a tcl array as a hash table where an [info exist
dictionary($someword)] could be used to check spelling.

As to threads, I would *highly* recommend you get a copy of Ashok's book,
The tcl programming language. It has a very good section on threads (and
many other topics).

See here: https://www.magicsplat.com/

I purchased both the paper book and the pdf version (great for searching). >>

**************************

My wordlist file has 1,248,300 lines. Each line is a word, yes.

I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?
I'm not confident enough in my own methods to measure these things.

You see, case will matter if I use [info exists dictionary($someword)]. Handling case in that scenario will also add overhead.

(I just realized I will have to split my word list in two, common words
and proper names, because proper names must not be all lowercase.)

But the lookup is no problem. That is fast enough. I spell check every
word in a sentence (maximum 80 characters) in one fell swoop and it
never feels slow at all.

No corrections or any kind of guessing though, just checking whether
the words exist or not. The correction suggestion part is currently
in the R&D stage. Well, just R, no D yet.

The bottleneck is definitely in loading the word list.

I did a test without trim and found that the words are not found and a
false misspell is flagged for everything. I soon realized that the
newlines become part of each word so the upshot is they invalidate the
entire dataset. I really have to axe them.

But I will try your file reading code and see if it's faster.

Either way, I would like the opportunity to learn about threads.

I really can't afford any book right now. I don't want to go into
details, suffice to say that I am in very very bad financial condition
right now. Like, really, no joke.

Ok, I guessed right, the list of words are a dictionary of sorts.

You can always just use the [time] command to measure anything you're trying out. Tcl arrays are (afaik) implemented as hash tables so they should be faster for lookup than lsearch.

Also, you can create codes, in addition to inclusion in a list. For example:

set dictionary(Tcl) 1
set dictionary(tcl) 0

to indicate one has a capital letter at the beginning.

As to threads. Here's my 2 cents:

I would create the thread like so:

package require Thread

set tid [thread::create {
# init some variables here (they are global)
proc load {args} {
#...
return $args/load
}
proc lookup {args} {
#...
return $args/lookup
}
proc maintid {tid} { ;# save main's tid so can thread::send back to main later
set ::maintid $tid
return $tid/maintid
}
# more procs
thread::wait
}]

# and then use it like this:

set filename "listofwords.txt"
thread::send -async $tid [list load $filename] result1 ;# start the dictionary loading
# .... can do other inits here
vwait result1

set word "theword"
thread::send $tid [list lookup $word] result2

thread::send $tid [list maintid [thread::id]] result3 ;# send in main thread id

Here's the output from the above

--- filename = |listofwords.txt|
--- result1 = |listofwords.txt/load|
--- result2 = |theword/lookup|
--- result3 = |tid000030DC/maintid|
--- tid = |tid000017DC|
--- word = |theword|

This is my preference for structuring threads; it makes them look just like procs.

In the above, you can call the load one time, -async, and when you are done loading everything else, you would wait on that result variable.

But... There is a *gotcha*. If you vwait *AFTER* the value has already been set by the thread's return send, you will wait forever.

So.... I always do a 2-step:

unset -nocomplain result
thread::send -async .... result
#... can do other things here even update or vwait on other variables
if {![info exist result]} {vwait result} ;# when you are ready to wait for the result

With this code, you avoid that possible race condition where you allow the thread result to get set in between your thread::send and t

From Rich@21:1/5 to Luc on Tue Dec 12 16:38:54 2023

Luc <[email protected]d> wrote:

The last mile I need, I think, is retrieving data from the thread.
More specifically, ::Wordlist after it is built.

There are several ways to retreive data from a thread.

1) use the 'tsv' module to setup a shared variable, have the sub-thread
store results in the shared variable, and have the main thread retreive
the results from the shared variable (but will sometimes need a way to synchronize the two).

2) don't use the -async option to thread::send, and just have the
result be 'returned'. But this makes the main thread wait, which
removes most of the value of running the sub-thread to gain
parallelism.

3) use the optional ?varname? with thread::send and -async to have the
result returned in varname, and vwait later to pickup the result.

4) send the data to process, and the id of the thread doing the send to
the sub-thread, and write the sub-thread to perform a thread::send to
the supplied id to set a variable with the result. I.e. something
like:

sending thread:
thread::send -async tid2 [list do-process $data [thread::id]]

receiving thread:
proc do-process {data tid} {
set result [do-something-with $data]
thread::send -async $tid [list set ::thread_result $result]
}

And there are probably more ways to "get back" the result than what
I've detailed above.

Do note that using threads will also bring you into the world of
parallell processing with all the complexity that implies, including
the need for synchronization at times. The
thread::(mutex|rwmutex|cond) commands provide the sync. primitives, but
you do have to use them as appropriate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Gerald Lester on Tue Dec 12 13:19:17 2023

On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:

On 12/11/23 19:56, Luc wrote:

On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
...

My wordlist file has 1,248,300 lines. Each line is a word, yes.

I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?

For a large list yes!

That all being said, you may want to step back and consider alternatives.

Suggestion: use SQLite...

1) Build "offline" (i.e. before you run your application) a SQLite DB
with a table that has one column each row one of your words.
2) Have your application open the SQLite Db and do searches on the table.

**************************

I thought about that. I just decided that plain txt files are easier
(or should I say more convenient) to manage. I know I will be adding
items to them as time goes by.

And like I said, I have zero problem with the lookup time. It's working
very fast, no delay whatsoever and I'm even running two queries on most
of the words.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Tue Dec 12 16:26:20 2023

Luc <[email protected]d> wrote:

On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:

But... I have a 400k word dictionary I got online somewhere. It's one word >>per line. No spaces, nothing to trim and no blank lines to remove. When I >>timed your code against a simpler,

set _fp [open $::WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist [split $data \n]
close $_fp

**************************

Well, I can say this, your code is many times as fast as mine.
Very noticeable difference.

I will be using that approach from now on.

Your code was iterating a loop, reading one line at a time, and doing
trims and checks for blank lines, for each line in your file (which you
said was over 1M lines long in another post.

So you were looping, in Tcl, over one million times, running the code
within your loop (a string trim and a check for blank lines).

et99's code is doing a single 'read' to bring the entire file in all at
once (much more IO efficient than reading one line at a time -- but
trades off against higher memory use).

Then he does a single 'split' on newliness to create the list.

So he is running two Tcl statements, with most of the 'work' being done
inside the C code which impliments those two commands, vs. running over
one million iterations of Tcl code to process one word at a time.

That is why et99's version is faster. But, et99's version presumes
that the words in the file are already "trimmed" and no blank lines
between any words. Both of which are reasonable to do as a "one-time" preprocess of the words, so as to not have to repeat it each time the
file is loaded.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Tue Dec 12 16:41:50 2023

Luc <[email protected]d> wrote:

On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:

On 12/11/23 19:56, Luc wrote:

On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
...

My wordlist file has 1,248,300 lines. Each line is a word, yes.

I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?

For a large list yes!

That all being said, you may want to step back and consider alternatives.

Suggestion: use SQLite...

1) Build "offline" (i.e. before you run your application) a SQLite DB
with a table that has one column each row one of your words.
2) Have your application open the SQLite Db and do searches on the table.

**************************

I thought about that. I just decided that plain txt files are easier
(or should I say more convenient) to manage. I know I will be adding
items to them as time goes by.

And like I said, I have zero problem with the lookup time. It's working
very fast, no delay whatsoever and I'm even running two queries on most
of the words.

You can keep your 'plain text' file, just setup a process to
'regenerate' the sqlite database whenever you update the plain text
file.

The advantage you get with sqlite is that all the preprocessing is done
ahead of time, and you only incur the "lookup time" when you do a
lookup.

A second advantage is your word list could be much larger than what you
can hold in memory when it is in a sqlite DB (although this advantage
has shrunk given the huge amount of RAM in modern systems).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Tue Dec 12 16:19:30 2023

Luc <[email protected]d> wrote:

I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?
I'm not confident enough in my own methods to measure these things.

You don't need to be confident in your own methods. Tcl provides a
'time' command that does all the work for you, and reports the time
taken. So just mock up both versions, and use [time] to time
executiong of a bunch of lookups, and see which is faster.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to [email protected] on Tue Dec 12 20:49:41 2023

et99 <[email protected]> wrote:

I would think with lsearch, not finding a word might take the
longest, if it's doing a sequential search.

For the default, yes, because the default is a sequential search.

But, if your list elements are sorted, you can use the "-sorted"
option, which speeds it up. The man page simply says "will use a more efficient searching algorithm to search list", I suspect "-sorted"
turns on a binary search of the list elements.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Rich on Tue Dec 12 13:14:19 2023

On 12/12/2023 12:49 PM, Rich wrote:

et99 <[email protected]> wrote:

I would think with lsearch, not finding a word might take the
longest, if it's doing a sequential search.

For the default, yes, because the default is a sequential search.

But, if your list elements are sorted, you can use the "-sorted"
option, which speeds it up. The man page simply says "will use a more efficient searching algorithm to search list", I suspect "-sorted"
turns on a binary search of the list elements.

Ahhh, good point. A binary search might only be doing around 20 compares on a 1mil word list.

% lsearch $Wordlist2 hello
154092
% lsearch -sorted $Wordlist2 hello
154092
% lsearch $Wordlist2 hellox
-1

% time {lsearch $Wordlist2 hello}
1127 microseconds per iteration
% time {lsearch -sorted $Wordlist2 hello}
12 microseconds per iteration
% time {lsearch -sorted $Wordlist2 hellox}
12 microseconds per iteration

and when run 50 times,

% time {lsearch -sorted $Wordlist2 hello} 50
0.78 microseconds per iteration
% time {lsearch -sorted $Wordlist2 hellox} 50
0.66 microseconds per iteration

Hah! so, no need for an array and a hash lookup.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Luc on Tue Dec 12 12:39:28 2023

On 12/12/2023 8:19 AM, Luc wrote:

On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:

On 12/11/23 19:56, Luc wrote:

On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
...

My wordlist file has 1,248,300 lines. Each line is a word, yes.

I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?

For a large list yes!

That all being said, you may want to step back and consider alternatives.

Suggestion: use SQLite...

1) Build "offline" (i.e. before you run your application) a SQLite DB
with a table that has one column each row one of your words.
2) Have your application open the SQLite Db and do searches on the table.

**************************

I thought about that. I just decided that plain txt files are easier
(or should I say more convenient) to manage. I know I will be adding
items to them as time goes by.

And like I said, I have zero problem with the lookup time. It's working
very fast, no delay whatsoever and I'm even running two queries on most
of the words.

One reason you may want a very fast lookup is you may want to eventually also make suggestions.

One way is to take a mis-spelled word and make changes, like reverse each pair of letters, add a letter and subtract a letter at all possible locations in the word, and then lookup the newly formed words. That's where having a very fast lookup can help.

Anyway, that's how I once did a spell checker. It worked decently well.

Incidentally, using a tsv shared variable array can be loaded like so:

proc load {WordlistFile} {
set _fp [open $WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist2 [split $data \n]
foreach word $Wordlist2 {
tsv::set dictionary $word 1
}
close $_fp

}

This takes only about 300ms to load my 400k dictionary. (A regular array was 200ms).

Then lookups are on the order of a few microseconds:

proc lookup {word} {
return [tsv::exists dictionary $word]
}

% lookup hello
1
% lookup hellox
0
% time {lookup hello}
8 microseconds per iteration
% time {lookup hellox}
7 microseconds per iteration

I would think with lsearch, not finding a word might take the longest, if it's doing a sequential search.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Tue Dec 12 19:08:01 2023

On Tue, 12 Dec 2023 12:39:28 -0800, et99 wrote:

One reason you may want a very fast lookup is you may want to eventually
also make suggestions.

I still have to investigate what algorithms there are out there. I tried
a couple of my own and they failed quite beautifully.

Anyway, I spent some time testing my newly acquired word list by looking
up all kinds of far fetched words to see how thorough it was. Out of convenience, I did it like this:

proc p.findword {word} {
puts -nonewline [lsearch -nocase $::BIGLIST $word]
}
foreach w {word this that something some other whatever dunno} {
puts "[p.findword $w] $w"
}

I went as far as 10 words at a time, and the output is instantaneous
for all the 10 lookups. With my old algorithm to read the list file
(the one million iterations), there was a very noticeable pause before
the output, but now that is instantaneous too.

So no worries about speed yet.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Rich on Tue Dec 12 18:59:13 2023

On Tue, 12 Dec 2023 16:41:50 -0000 (UTC), Rich wrote:

You can keep your 'plain text' file, just setup a process to
'regenerate' the sqlite database whenever you update the plain text
file.

The advantage you get with sqlite is that all the preprocessing is done
ahead of time, and you only incur the "lookup time" when you do a
lookup.

A second advantage is your word list could be much larger than what you
can hold in memory when it is in a sqlite DB (although this advantage
has shrunk given the huge amount of RAM in modern systems).

**************************

Great tips, I always learn a lot here. Thank you.

However, I don't understand what you mean by "all the preprocessing is
done ahead of time."

What preprocessing? The file is "slurped" once at launch then the word
list is permanently available in a list. Why would acess to that list
(in memory, I assume) be slower to access to a database (on disk, for
sure)?

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Wed Dec 13 00:05:41 2023

Luc <[email protected]d> wrote:

On Tue, 12 Dec 2023 16:41:50 -0000 (UTC), Rich wrote:

You can keep your 'plain text' file, just setup a process to
'regenerate' the sqlite database whenever you update the plain text
file.

The advantage you get with sqlite is that all the preprocessing is done >>ahead of time, and you only incur the "lookup time" when you do a
lookup.

A second advantage is your word list could be much larger than what you
can hold in memory when it is in a sqlite DB (although this advantage
has shrunk given the huge amount of RAM in modern systems).

**************************

Great tips, I always learn a lot here. Thank you.

However, I don't understand what you mean by "all the preprocessing is
done ahead of time."

What preprocessing?

Trimming spaces from the words only needs to be done once. Splitting
lines into words only needs to be done once.

The file is "slurped" once at launch then the word list is
permanently available in a list.

Part of your slurp involves a split on \n to convert the lines into
words. While that step, on modern CPU's, is fast enough you get the
luxury to ignore it, it is "preprocessing" that you are performing
every time you slurp the file in from disk.

Why would acess to that list (in memory, I assume) be slower to
access to a database (on disk, for sure)?

For things that fit in ram, and a list, and provided you have the list
sorted, and use the -sorted option to list, then lookups in the list
likely will beat sqlite. But, if the wordlist grows too large for
memory (this is unlikely for your specific use case, but for other
kinds of "data" is very common) or you don't keep it sorted so you have
to use lsearch's linear search then sqlite (provided you tell sqlite to
index the lookup column) will beat the list method in most cases.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Rich on Tue Dec 12 21:36:44 2023

On Wed, 13 Dec 2023 00:05:41 -0000 (UTC), Rich wrote:

For things that fit in ram, and a list, and provided you have the list >sorted, and use the -sorted option to list, then lookups in the list
likely will beat sqlite. But, if the wordlist grows too large for
memory (this is unlikely for your specific use case, but for other
kinds of "data" is very common) or you don't keep it sorted so you have
to use lsearch's linear search then sqlite (provided you tell sqlite to
index the lookup column) will beat the list method in most cases.

**************************

Do you by any chance know what happens if I use lsearch -sorted on a
list that

A. is not perfectly or completely sorted (new items have been added to
the end)

B. I run the garden variety GNU 'sort' command on the word list file
so it may not comply exactly with whatever lsearch thinks should be
considered "sorted" (ascii, alnum, etc.)?

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Luc on Tue Dec 12 21:51:14 2023

On Tue, 12 Dec 2023 21:36:44 -0300, Luc wrote:

Do you by any chance know what happens if I use lsearch -sorted on a
list that

A. is not perfectly or completely sorted (new items have been added to
the end)

B. I run the garden variety GNU 'sort' command on the word list file
so it may not comply exactly with whatever lsearch thinks should be >considered "sorted" (ascii, alnum, etc.)?

**************************

Whoa. I don't know what is going on, but something is going on and
it's bad.

proc p.findword {word} {
puts -nonewline [lsearch -nocase $::BIGLIST $word]
}
foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}

9 out of 10 words found.

Now, using -sorted:

proc p.findword {word} {
puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
}
foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}

Visibly faster, but only 3 out of 10 words found.

Not good.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Luc on Tue Dec 12 18:28:23 2023

On 12/12/2023 4:51 PM, Luc wrote:

On Tue, 12 Dec 2023 21:36:44 -0300, Luc wrote:

Do you by any chance know what happens if I use lsearch -sorted on a
list that

A. is not perfectly or completely sorted (new items have been added to
the end)

B. I run the garden variety GNU 'sort' command on the word list file
so it may not comply exactly with whatever lsearch thinks should be
considered "sorted" (ascii, alnum, etc.)?

**************************

Whoa. I don't know what is going on, but something is going on and
it's bad.

proc p.findword {word} {
puts -nonewline [lsearch -nocase $::BIGLIST $word]
}
foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}

9 out of 10 words found.

Now, using -sorted:

proc p.findword {word} {
puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
}
foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}

Visibly faster, but only 3 out of 10 words found.

Not good.

What I meant by pre-processing was to take your list as cleaned up, sorted, etc. and write it out, once. Thereafter, you could use the read/split to restore it to memory quickly.

If, however, you are going to be adding words during a run, you could just keep 2 lists. The second list would likely be very short if added by the user during a session. Merging new words in might be a pain, and re-sorting the entire list likewise.

On the other hand, this is a plus for using the array, since order isn't important there, as it's just hashing them.

But are you also going to let the user do a "save dictionary" after adding in new words? Programs never do stay simple :)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Wed Dec 13 03:00:28 2023

Luc <[email protected]d> wrote:

Whoa. I don't know what is going on, but something is going on and
it's bad.

proc p.findword {word} {
puts -nonewline [lsearch -nocase $::BIGLIST $word]
}
foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}

9 out of 10 words found.

Now, using -sorted:

proc p.findword {word} {
puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
}
foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}

When you use "-sorted", ::BIGLIST is, in fact, sorted, right?

Visibly faster, but only 3 out of 10 words found.

Not good.

Given the reduction in hits, this implies you do not have ::BIGLIST
sorted.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Wed Dec 13 02:58:43 2023

Luc <[email protected]d> wrote:

On Wed, 13 Dec 2023 00:05:41 -0000 (UTC), Rich wrote:

For things that fit in ram, and a list, and provided you have the list >>sorted, and use the -sorted option to list, then lookups in the list
likely will beat sqlite. But, if the wordlist grows too large for
memory (this is unlikely for your specific use case, but for other
kinds of "data" is very common) or you don't keep it sorted so you have
to use lsearch's linear search then sqlite (provided you tell sqlite to >>index the lookup column) will beat the list method in most cases.

**************************

Do you by any chance know what happens if I use lsearch -sorted on a
list that

A. is not perfectly or completely sorted (new items have been added to
the end)

Most likely the search will complete, but it may or may not find the
requested element.

Try creating a random ordered list from your long word list, then do a
lsearch -sorted on it and see what happens. ::struct::list shuffle
from Tcllib can randomly order your word list for you.

B. I run the garden variety GNU 'sort' command on the word list file
so it may not comply exactly with whatever lsearch thinks should be considered "sorted" (ascii, alnum, etc.)?

The manpage implies that -sorted expects the list to be sorted via the
-ascii comparison operator of lsort. Since this is unicode code point
order, it likely exactly matches that which gnu sort produces in its
default state.

But there are only two possibilities:

1) gnu sort sorts in the order that lsearch assumes for -sorted - in
which case you get back correct answers

2) gnu sort sorts slightly differently -- in which case you may get
back any of:

no answers at all (i.e. always a 'not found')

the wrong answers (although it is much more likely to simply return
"not found")

sometimes the right answer, sometimes a false "not found"

This third one would occur when the list is almost in the right order,
but only a few elements differ. If what you search for ends up being
in a segmennt that is ordered as expected, you get the right answer (or
a true "not found" answer). If what you search for ends up being in a
segment that is not ordered as expected, you most likely get back a
false not found.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Rich on Wed Dec 13 08:54:35 2023

On Wed, 13 Dec 2023 03:00:28 -0000 (UTC), Rich wrote:

When you use "-sorted", ::BIGLIST is, in fact, sorted, right?

Visibly faster, but only 3 out of 10 words found.

Not good.

Given the reduction in hits, this implies you do not have ::BIGLIST
sorted.

**************************

::BIGLIST is slurped straight from the file which was a merge of multiple
word lists and dictionaries I found here and there, then sorted with
sort -u to remove the duplicates.

So it is sorted, but I guess it's not sorted in the way that lsearch
expects.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Wed Dec 13 09:10:38 2023

On Tue, 12 Dec 2023 18:28:23 -0800, et99 wrote:

What I meant by pre-processing was to take your list as cleaned up,
sorted, etc. and write it out, once. Thereafter, you could use the
read/split to restore it to memory quickly.

If, however, you are going to be adding words during a run, you could just >keep 2 lists. The second list would likely be very short if added by the
user during a session. Merging new words in might be a pain, and
re-sorting the entire list likewise.

On the other hand, this is a plus for using the array, since order isn't >important there, as it's just hashing them.

But are you also going to let the user do a "save dictionary" after adding
in new words? Programs never do stay simple :)

**************************

Well, yes. It's done and in production already.

You see, names are simple. They have to begin with a capital letter.

But "begin" means it can be either Mary or MARY. For that I need some
kind of -nocase parameter or one normalization step plus a second lookup.
That may or may not defeat the superior speed of array lookups or more
likely just make the difference less meaningful.

Common words are less simple. In the beginning of a sentence, they must
begin with a capital letter. In the middle of a sentence, they must
begin with a small letter. But in either case it may be all upper case
too.

The shortest route I could think of was two lists: things and names.
1. Search in the first list with no case and that's it.
2. Not found? Search in the second list as is and that's it.
3. Still not found? Capitalize it and look for it again in the list
of names.

In case you're wondering, the problem of capitalizing words (or not)
according to punctuation is taken care of by a completely different proc
that does auto correct according to another list. I actually use the
concept of auto correct to auto expand abbreviations and type faster.
That proc takes care of capitalization according to punctuation.
In a public application that would not be good enough, but since this
is for private use and is working as intended, I won't bother fixing
what ain't broken.

But another problem comes up.

In my current design, boxes with any problem cannot be approved and I am
not allowed to jump to the next one until the problem is properly fixed.
A "problem" currently means too many characters or an empty box. Empty
boxes may be desirable in certain circumstances so there is a "force"
command (and key shortcut) in case I want to override it. Misspellings
will just be a third kind of problem.

Workflow speed is always a priority with this thing so I implemented the possibility of a double override action. The first override key press
will add all unknown words to the word list and the second override will "approve and move forward."

But then I can't distinguish things from names. I can, but I guess I
would have to introduce a pop-up to decide which one every time. That
would slow things down. I though that maybe it would be better to just
use one global word list and take care of casing with my own human proofreading.

Then again, unknown words are highly likely to be proper names so I
decided to detect their case and send them straight to the names list
if they are written with a capital letter whether it's a name or not.
If they are not a name and happen to show up again in small letters,
then I will add them again, in which case they will go to the word list.

Now words or names are always added twice: to the list in memory and
appended to the file on disk.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Luc on Wed Dec 13 09:15:43 2023

On Wed, 13 Dec 2023 08:54:35 -0300, Luc wrote:

::BIGLIST is slurped straight from the file which was a merge of multiple >word lists and dictionaries I found here and there, then sorted with
sort -u to remove the duplicates.

So it is sorted, but I guess it's not sorted in the way that lsearch
expects.

**************************

Well, I added an lsort step to the file slurp procedure and now using
lsort -nocase -sorted yields all the expected search hits.

And everything is still very fast.

So I guess that much is taken care of.

Thank you again.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Wed Dec 13 14:26:29 2023

Luc <[email protected]d> wrote:

On Wed, 13 Dec 2023 03:00:28 -0000 (UTC), Rich wrote:

When you use "-sorted", ::BIGLIST is, in fact, sorted, right?

Visibly faster, but only 3 out of 10 words found.

Not good.

Given the reduction in hits, this implies you do not have ::BIGLIST
sorted.

**************************

::BIGLIST is slurped straight from the file which was a merge of multiple word lists and dictionaries I found here and there, then sorted with
sort -u to remove the duplicates.

So it is sorted, but I guess it's not sorted in the way that lsearch
expects.

That would be my guess.

Slurp it in, sort it with lsort, then save the result to a second file
(note you'll want to use [join] and join with \n when saving to a
second file.

Then run "diff -u" on both files (and being 1m lines, pipe the result
to less) and see if there is a difference, and what is different.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Wed Dec 13 14:53:37 2023

Luc <[email protected]d> wrote:

On Wed, 13 Dec 2023 08:54:35 -0300, Luc wrote:

::BIGLIST is slurped straight from the file which was a merge of multiple >>word lists and dictionaries I found here and there, then sorted with
sort -u to remove the duplicates.

So it is sorted, but I guess it's not sorted in the way that lsearch >>expects.

**************************

Well, I added an lsort step to the file slurp procedure and now using
lsort -nocase -sorted yields all the expected search hits.

Do note that if you lsort before saving, then you don't have to lsort
upon loading until such time as you add more words.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Luc on Wed Dec 13 13:33:45 2023

On 12/13/2023 4:10 AM, Luc wrote:

On Tue, 12 Dec 2023 18:28:23 -0800, et99 wrote:

What I meant by pre-processing was to take your list as cleaned up,
sorted, etc. and write it out, once. Thereafter, you could use the
read/split to restore it to memory quickly.

If, however, you are going to be adding words during a run, you could just >> keep 2 lists. The second list would likely be very short if added by the
user during a session. Merging new words in might be a pain, and
re-sorting the entire list likewise.

On the other hand, this is a plus for using the array, since order isn't
important there, as it's just hashing them.

But are you also going to let the user do a "save dictionary" after adding >> in new words? Programs never do stay simple :)

**************************

Well, yes. It's done and in production already.

You see, names are simple. They have to begin with a capital letter.

But "begin" means it can be either Mary or MARY. For that I need some
kind of -nocase parameter or one normalization step plus a second lookup. That may or may not defeat the superior speed of array lookups or more
likely just make the difference less meaningful.

Common words are less simple. In the beginning of a sentence, they must
begin with a capital letter. In the middle of a sentence, they must
begin with a small letter. But in either case it may be all upper case
too.

The shortest route I could think of was two lists: things and names.
1. Search in the first list with no case and that's it.
2. Not found? Search in the second list as is and that's it.
3. Still not found? Capitalize it and look for it again in the list
of names.

In case you're wondering, the problem of capitalizing words (or not) according to punctuation is taken care of by a completely different proc
that does auto correct according to another list. I actually use the
concept of auto correct to auto expand abbreviations and type faster.
That proc takes care of capitalization according to punctuation.
In a public application that would not be good enough, but since this
is for private use and is working as intended, I won't bother fixing
what ain't broken.

But another problem comes up.

In my current design, boxes with any problem cannot be approved and I am
not allowed to jump to the next one until the problem is properly fixed.
A "problem" currently means too many characters or an empty box. Empty
boxes may be desirable in certain circumstances so there is a "force"
command (and key shortcut) in case I want to override it. Misspellings
will just be a third kind of problem.

Workflow speed is always a priority with this thing so I implemented the possibility of a double override action. The first override key press
will add all unknown words to the word list and the second override will "approve and move forward."

But then I can't distinguish things from names. I can, but I guess I
would have to introduce a pop-up to decide which one every time. That
would slow things down. I though that maybe it would be better to just
use one global word list and take care of casing with my own human proofreading.

Then again, unknown words are highly likely to be proper names so I
decided to detect their case and send them straight to the names list
if they are written with a capital letter whether it's a name or not.
If they are not a name and happen to show up again in small letters,
then I will add them again, in which case they will go to the word list.

Now words or names are always added twice: to the list in memory and
appended to the file on disk.

On your 3 step approach:

What about words that can be both, like Cat Stevens and I have a cat; Drew and drew a picture. Will you accept the user's case choice in that situation?

I see this as 2 different problems. Checking spelling and checking capitalization. You mention you already have separate procs for this.

As to sorting your list(s). What is the purpose other than to make lsearch lookup faster? As to doing a -nocase search, how is that any different than maintaining your list(s) in a single case and converting to that case before lookup?

So, two choices, use an array or use a list.

Here's my check list:

List:

Must be kept sorted for speed.
Adding words needs to re-sort or merge.
Duplicates could be a problem (but I'm not sure)
Can have more than 1 list

Array:

No need to sort for speed
Need to keep in one case (say lower) and convert to (lower) before lookup Adding words is no problem, just make it lowercase first
No problems with duplicates, adding the same word is a no-op
Can have multiple arrays

As to the need to have an in memory plus a disk file would mean:

List:

Have to resort when adding new words

Array:

Just add any new word to the end of the file. No need to sort.

As to loading up with a separate thread. The tsv variables are arrays. A 2nd thread can load up a tsv array from a file and the main thread can see that shared array without needing to do lookups in the second thread (as I mistakenly wrote in a prior
post).

My timings for tsv arrays with [tsv::exists dictionary $word] show it's just as fast as a non-shared array.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Wed Dec 13 19:33:27 2023

On Wed, 13 Dec 2023 13:33:45 -0800, et99 wrote:

On your 3 step approach:

What about words that can be both, like Cat Stevens and I have a cat; Drew >and drew a picture. Will you accept the user's case choice in that
situation?

First, the user (aka "yours truly") gets a warning that a certain word is unknown. The user decides whether that word really is a typo or a new word
that should be added to the spelling dictionary. It's a human in control.
Which is good.

My word list is very thourough. I had a little fun looking for weird words
and only two or three were missing and all of them were slang. I had to
cheat a little to find something missing. So what kind of words will be
flagged as unknown although being correct? Names. I reckon about 99% of
the times.

Remember, I want to do things fast for optimum productivity. Being able
to press one key and add the word to its proper list automatically is considered valuable. If the word begins with capital case, it's even more likely to be a name. The odds of an unknown word that is not a name being
the first one in a sentence are very slim. But if it doesn't have a
capital letter then it has to be a word.

So maybe it is not a name. So maybe it was the first word in a sentence.
Well, it's a very rare exception and I don't mind adding it to the list
of names. The list won't be "tarnished." I don't care. The worst that
can happen is it shows up again (hopefully in small letter this time) and
I have to add it again. No big deal. It's a very rare occurrence.

The separation of things and names is good because it won't flag "Mary,"
but it will flag "mary." I do want that feature. The only problem is it
will not flag "john" (in case it refers to the restroom), but oh well,
this is a typical corner case and no spelling checking system is perfect.
I know I have worked with a lot of them. They always need human
supervision.

So, two choices, use an array or use a list.

I will investigate the choice of array, but later. I have other
priorities right now. Like I said, I don't have a problem with lookup
speed. I had a problem with loading the list off a file at program
launch, but your code really solved that one. :-)

Threads are going to the back burner too. I don't need them now. I
copied and saved all the advice that wasshared here. I will have another
look at it eventually. I want to learn it.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	59:37:05
Calls:	12,446
Calls today:	1
Files:	15,192
Messages:	6,537,434

Retrieving data from a thread

Who's Online

Recent Visitors

System Info