Hi.
My application used to take 7 seconds to load. A little long.
Now I am trying to implement some kind of spell checker so I just added
code to load and process a file.
set ::WordlistFile /home/tcl/longlist.txt
set ::Wordlist ""
set _fp [open $::WordlistFile r]
while {![eof $_fp]} {
set _line [string trim [gets $_fp]]
if {$_line == ""} {continue}
lappend ::Wordlist $_line
}
close $_fp
Now it takes 10 seconds to load. Grumble.
I thought about using threads to load the word list and make it all
load faster.
In fact, it already loads another list. Maybe I could "threadify" both?
Anyway,
I've been reading about threads and I must say the existing documentation
is not very easy to understand. Code examples on google are pretty scarce too. I found an interesting one on StackOverflow where Donal (DKF)
suggests using pools, but I couldn't make that work. I mean, I have to
"get" data with tpool::get? But when? How do I know when the job is done?
He also suggests tsv, but I found the relevant documentation hard to read
and understand.
Yet I've made prototypes that almost worked. Almost.
The last mile I need, I think, is retrieving data from the thread. More specifically, ::Wordlist after it is built.
I wish I didn't have to call it explicitly. Just let the thread build
and set ::Wordlist whenever it feels ready. My goal is to let the thread
load the list while the rest of the application loads other things.
Of course, it must not take longer than a few seconds. It has to be done
when I begin to type.
Among so many attempts, I only came up with one that worked. But
the application wouldn't load any faster. I must have done something
wrong. Can you please enlighten me?
package require Thread
set thr [thread::create]
thread::send $thr "set wordlistfile $::WordlistFile"
thread::send -async $thr {
set _fp [open $wordlistfile r]
while {![eof $_fp]} {
set _line [string trim [gets $_fp]]
if {$_line == ""} {continue}
lappend ::Wordlist $_line
}
close $_fp
}
puts "Now what?"
What is /home/tcl/longlist.txt
Are these the words to lookup or your spelling dictionary. I hate to
assume. Based on your variable names, I can't tell which it is.
But... I have a 400k word dictionary I got online somewhere. It's one word >per line. No spaces, nothing to trim and no blank lines to remove. When I >timed your code against a simpler,
set _fp [open $::WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist [split $data \n]
close $_fp
It went from 700ms to 50ms to create Wordlist. But again, not knowing for >sure what that file really is....
For example, if it is your spelling dictionary, I would preprocess it so
the trim and blank line tests wouldn't be needed, and then I would store
it in a tcl array as a hash table where an [info exist
dictionary($someword)] could be used to check spelling.
As to threads, I would *highly* recommend you get a copy of Ashok's book,
The tcl programming language. It has a very good section on threads (and
many other topics).
See here: https://www.magicsplat.com/
I purchased both the paper book and the pdf version (great for searching).
But... I have a 400k word dictionary I got online somewhere. It's one word >per line. No spaces, nothing to trim and no blank lines to remove. When I >timed your code against a simpler,**************************
set _fp [open $::WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist [split $data \n]
close $_fp
On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
What is /home/tcl/longlist.txt**************************
Are these the words to lookup or your spelling dictionary. I hate to
assume. Based on your variable names, I can't tell which it is.
But... I have a 400k word dictionary I got online somewhere. It's one word >> per line. No spaces, nothing to trim and no blank lines to remove. When I
timed your code against a simpler,
set _fp [open $::WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist [split $data \n]
close $_fp
It went from 700ms to 50ms to create Wordlist. But again, not knowing for
sure what that file really is....
For example, if it is your spelling dictionary, I would preprocess it so
the trim and blank line tests wouldn't be needed, and then I would store
it in a tcl array as a hash table where an [info exist
dictionary($someword)] could be used to check spelling.
As to threads, I would *highly* recommend you get a copy of Ashok's book,
The tcl programming language. It has a very good section on threads (and
many other topics).
See here: https://www.magicsplat.com/
I purchased both the paper book and the pdf version (great for searching). >>
My wordlist file has 1,248,300 lines. Each line is a word, yes.
I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?
I'm not confident enough in my own methods to measure these things.
You see, case will matter if I use [info exists dictionary($someword)]. Handling case in that scenario will also add overhead.
(I just realized I will have to split my word list in two, common words
and proper names, because proper names must not be all lowercase.)
But the lookup is no problem. That is fast enough. I spell check every
word in a sentence (maximum 80 characters) in one fell swoop and it
never feels slow at all.
No corrections or any kind of guessing though, just checking whether
the words exist or not. The correction suggestion part is currently
in the R&D stage. Well, just R, no D yet.
The bottleneck is definitely in loading the word list.
I did a test without trim and found that the words are not found and a
false misspell is flagged for everything. I soon realized that the
newlines become part of each word so the upshot is they invalidate the
entire dataset. I really have to axe them.
But I will try your file reading code and see if it's faster.
Either way, I would like the opportunity to learn about threads.
I really can't afford any book right now. I don't want to go into
details, suffice to say that I am in very very bad financial condition
right now. Like, really, no joke.
The last mile I need, I think, is retrieving data from the thread.
More specifically, ::Wordlist after it is built.
On 12/11/23 19:56, Luc wrote:
On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
...
My wordlist file has 1,248,300 lines. Each line is a word, yes.
I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?
For a large list yes!
That all being said, you may want to step back and consider alternatives.
Suggestion: use SQLite...
1) Build "offline" (i.e. before you run your application) a SQLite DB
with a table that has one column each row one of your words.
2) Have your application open the SQLite Db and do searches on the table.
On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
But... I have a 400k word dictionary I got online somewhere. It's one word >>per line. No spaces, nothing to trim and no blank lines to remove. When I >>timed your code against a simpler,**************************
set _fp [open $::WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist [split $data \n]
close $_fp
Well, I can say this, your code is many times as fast as mine.
Very noticeable difference.
I will be using that approach from now on.
On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:
On 12/11/23 19:56, Luc wrote:**************************
On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
...
My wordlist file has 1,248,300 lines. Each line is a word, yes.
I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?
For a large list yes!
That all being said, you may want to step back and consider alternatives.
Suggestion: use SQLite...
1) Build "offline" (i.e. before you run your application) a SQLite DB
with a table that has one column each row one of your words.
2) Have your application open the SQLite Db and do searches on the table.
I thought about that. I just decided that plain txt files are easier
(or should I say more convenient) to manage. I know I will be adding
items to them as time goes by.
And like I said, I have zero problem with the lookup time. It's working
very fast, no delay whatsoever and I'm even running two queries on most
of the words.
I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?
I'm not confident enough in my own methods to measure these things.
I would think with lsearch, not finding a word might take the
longest, if it's doing a sequential search.
et99 <[email protected]> wrote:
I would think with lsearch, not finding a word might take the
longest, if it's doing a sequential search.
For the default, yes, because the default is a sequential search.
But, if your list elements are sorted, you can use the "-sorted"
option, which speeds it up. The man page simply says "will use a more efficient searching algorithm to search list", I suspect "-sorted"
turns on a binary search of the list elements.
On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:
On 12/11/23 19:56, Luc wrote:**************************
On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
...
My wordlist file has 1,248,300 lines. Each line is a word, yes.
I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?
For a large list yes!
That all being said, you may want to step back and consider alternatives.
Suggestion: use SQLite...
1) Build "offline" (i.e. before you run your application) a SQLite DB
with a table that has one column each row one of your words.
2) Have your application open the SQLite Db and do searches on the table.
I thought about that. I just decided that plain txt files are easier
(or should I say more convenient) to manage. I know I will be adding
items to them as time goes by.
And like I said, I have zero problem with the lookup time. It's working
very fast, no delay whatsoever and I'm even running two queries on most
of the words.
One reason you may want a very fast lookup is you may want to eventually
also make suggestions.
You can keep your 'plain text' file, just setup a process to
'regenerate' the sqlite database whenever you update the plain text
file.
The advantage you get with sqlite is that all the preprocessing is done
ahead of time, and you only incur the "lookup time" when you do a
lookup.
A second advantage is your word list could be much larger than what you
can hold in memory when it is in a sqlite DB (although this advantage
has shrunk given the huge amount of RAM in modern systems).
On Tue, 12 Dec 2023 16:41:50 -0000 (UTC), Rich wrote:
You can keep your 'plain text' file, just setup a process to**************************
'regenerate' the sqlite database whenever you update the plain text
file.
The advantage you get with sqlite is that all the preprocessing is done >>ahead of time, and you only incur the "lookup time" when you do a
lookup.
A second advantage is your word list could be much larger than what you
can hold in memory when it is in a sqlite DB (although this advantage
has shrunk given the huge amount of RAM in modern systems).
Great tips, I always learn a lot here. Thank you.
However, I don't understand what you mean by "all the preprocessing is
done ahead of time."
What preprocessing?
The file is "slurped" once at launch then the word list is
permanently available in a list.
Why would acess to that list (in memory, I assume) be slower to
access to a database (on disk, for sure)?
For things that fit in ram, and a list, and provided you have the list >sorted, and use the -sorted option to list, then lookups in the list
likely will beat sqlite. But, if the wordlist grows too large for
memory (this is unlikely for your specific use case, but for other
kinds of "data" is very common) or you don't keep it sorted so you have
to use lsearch's linear search then sqlite (provided you tell sqlite to
index the lookup column) will beat the list method in most cases.
Do you by any chance know what happens if I use lsearch -sorted on a
list that
A. is not perfectly or completely sorted (new items have been added to
the end)
B. I run the garden variety GNU 'sort' command on the word list file
so it may not comply exactly with whatever lsearch thinks should be >considered "sorted" (ascii, alnum, etc.)?
On Tue, 12 Dec 2023 21:36:44 -0300, Luc wrote:
Do you by any chance know what happens if I use lsearch -sorted on a**************************
list that
A. is not perfectly or completely sorted (new items have been added to
the end)
B. I run the garden variety GNU 'sort' command on the word list file
so it may not comply exactly with whatever lsearch thinks should be
considered "sorted" (ascii, alnum, etc.)?
Whoa. I don't know what is going on, but something is going on and
it's bad.
proc p.findword {word} {
puts -nonewline [lsearch -nocase $::BIGLIST $word]
}
foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}
9 out of 10 words found.
Now, using -sorted:
proc p.findword {word} {
puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
}
foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}
Visibly faster, but only 3 out of 10 words found.
Not good.
Whoa. I don't know what is going on, but something is going on and
it's bad.
proc p.findword {word} {
puts -nonewline [lsearch -nocase $::BIGLIST $word]
}
foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}
9 out of 10 words found.
Now, using -sorted:
proc p.findword {word} {
puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
}
foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}
Visibly faster, but only 3 out of 10 words found.
Not good.
On Wed, 13 Dec 2023 00:05:41 -0000 (UTC), Rich wrote:
For things that fit in ram, and a list, and provided you have the list >>sorted, and use the -sorted option to list, then lookups in the list**************************
likely will beat sqlite. But, if the wordlist grows too large for
memory (this is unlikely for your specific use case, but for other
kinds of "data" is very common) or you don't keep it sorted so you have
to use lsearch's linear search then sqlite (provided you tell sqlite to >>index the lookup column) will beat the list method in most cases.
Do you by any chance know what happens if I use lsearch -sorted on a
list that
A. is not perfectly or completely sorted (new items have been added to
the end)
B. I run the garden variety GNU 'sort' command on the word list file
so it may not comply exactly with whatever lsearch thinks should be considered "sorted" (ascii, alnum, etc.)?
When you use "-sorted", ::BIGLIST is, in fact, sorted, right?**************************
Visibly faster, but only 3 out of 10 words found.
Not good.
Given the reduction in hits, this implies you do not have ::BIGLIST
sorted.
What I meant by pre-processing was to take your list as cleaned up,
sorted, etc. and write it out, once. Thereafter, you could use the
read/split to restore it to memory quickly.
If, however, you are going to be adding words during a run, you could just >keep 2 lists. The second list would likely be very short if added by the
user during a session. Merging new words in might be a pain, and
re-sorting the entire list likewise.
On the other hand, this is a plus for using the array, since order isn't >important there, as it's just hashing them.
But are you also going to let the user do a "save dictionary" after adding
in new words? Programs never do stay simple :)
::BIGLIST is slurped straight from the file which was a merge of multiple >word lists and dictionaries I found here and there, then sorted with
sort -u to remove the duplicates.
So it is sorted, but I guess it's not sorted in the way that lsearch
expects.
On Wed, 13 Dec 2023 03:00:28 -0000 (UTC), Rich wrote:
When you use "-sorted", ::BIGLIST is, in fact, sorted, right?**************************
Visibly faster, but only 3 out of 10 words found.
Not good.
Given the reduction in hits, this implies you do not have ::BIGLIST
sorted.
::BIGLIST is slurped straight from the file which was a merge of multiple word lists and dictionaries I found here and there, then sorted with
sort -u to remove the duplicates.
So it is sorted, but I guess it's not sorted in the way that lsearch
expects.
On Wed, 13 Dec 2023 08:54:35 -0300, Luc wrote:
::BIGLIST is slurped straight from the file which was a merge of multiple >>word lists and dictionaries I found here and there, then sorted with**************************
sort -u to remove the duplicates.
So it is sorted, but I guess it's not sorted in the way that lsearch >>expects.
Well, I added an lsort step to the file slurp procedure and now using
lsort -nocase -sorted yields all the expected search hits.
On Tue, 12 Dec 2023 18:28:23 -0800, et99 wrote:
What I meant by pre-processing was to take your list as cleaned up,**************************
sorted, etc. and write it out, once. Thereafter, you could use the
read/split to restore it to memory quickly.
If, however, you are going to be adding words during a run, you could just >> keep 2 lists. The second list would likely be very short if added by the
user during a session. Merging new words in might be a pain, and
re-sorting the entire list likewise.
On the other hand, this is a plus for using the array, since order isn't
important there, as it's just hashing them.
But are you also going to let the user do a "save dictionary" after adding >> in new words? Programs never do stay simple :)
Well, yes. It's done and in production already.
You see, names are simple. They have to begin with a capital letter.
But "begin" means it can be either Mary or MARY. For that I need some
kind of -nocase parameter or one normalization step plus a second lookup. That may or may not defeat the superior speed of array lookups or more
likely just make the difference less meaningful.
Common words are less simple. In the beginning of a sentence, they must
begin with a capital letter. In the middle of a sentence, they must
begin with a small letter. But in either case it may be all upper case
too.
The shortest route I could think of was two lists: things and names.
1. Search in the first list with no case and that's it.
2. Not found? Search in the second list as is and that's it.
3. Still not found? Capitalize it and look for it again in the list
of names.
In case you're wondering, the problem of capitalizing words (or not) according to punctuation is taken care of by a completely different proc
that does auto correct according to another list. I actually use the
concept of auto correct to auto expand abbreviations and type faster.
That proc takes care of capitalization according to punctuation.
In a public application that would not be good enough, but since this
is for private use and is working as intended, I won't bother fixing
what ain't broken.
But another problem comes up.
In my current design, boxes with any problem cannot be approved and I am
not allowed to jump to the next one until the problem is properly fixed.
A "problem" currently means too many characters or an empty box. Empty
boxes may be desirable in certain circumstances so there is a "force"
command (and key shortcut) in case I want to override it. Misspellings
will just be a third kind of problem.
Workflow speed is always a priority with this thing so I implemented the possibility of a double override action. The first override key press
will add all unknown words to the word list and the second override will "approve and move forward."
But then I can't distinguish things from names. I can, but I guess I
would have to introduce a pop-up to decide which one every time. That
would slow things down. I though that maybe it would be better to just
use one global word list and take care of casing with my own human proofreading.
Then again, unknown words are highly likely to be proper names so I
decided to detect their case and send them straight to the names list
if they are written with a capital letter whether it's a name or not.
If they are not a name and happen to show up again in small letters,
then I will add them again, in which case they will go to the word list.
Now words or names are always added twice: to the list in memory and
appended to the file on disk.
On your 3 step approach:
What about words that can be both, like Cat Stevens and I have a cat; Drew >and drew a picture. Will you accept the user's case choice in that
situation?
So, two choices, use an array or use a list.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 18:38:52 |
| Calls: | 12,103 |
| Calls today: | 3 |
| Files: | 15,004 |
| Messages: | 6,518,083 |