Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3 times
re: hyphenated words (you can treat it anyway you like)
but ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.
On 31/05/24 08:03, HenHanna via Python-list wrote:
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3 times
re: hyphenated words (you can treat it anyway you like)
but ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.
Split into words - defined as you will.
Use Counter.
Show some (of your) code and we'll be happy to critique...
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3 times
re: hyphenated words (you can treat it anyway you like)
but ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3 times
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3 times
re: hyphenated words (you can treat it anyway you like)
but ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.
hard to decide what to do with hyphens
and apostrophes
(I'd, he's, can't, haven't, A's and B's)
On 2024-05-30 19:26:37 -0700, HenHanna via Python-list wrote:
hard to decide what to do with hyphens
and apostrophes
(I'd, he's, can't, haven't, A's and B's)
Especially since the same character is used as both an apostrophe and a closing quotation mark. And while that's pretty unambiguous between to characters it isn't at the end of a word:
This is Alex’ house.
This type of building is called an ‘Alex’ house.
The sentence ‘We are meeting at Alex’ house’ contains an apostrophe.
(using proper unicode quotation marks. It get's worse if you stick to
ASCII.)
Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018
LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as
single quotation marks[1], but despite the suggestive names, this is not
the common typographical convention, so your texts are unlikely to make
this distinction.
hp
[1] Which I use rarely, anyway.
HenHanna wrote at 2024-5-30 13:03 -0700:
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
Your task can be split into several subtasks:
* parse the text into words
This depends on your notion of "word".
In the simplest case, a word is any maximal sequence of non-whitespace
characters. In this case, you can use `split` for this task
On 5/31/24 11:59, Dieter Maurer via Python-list wrote:
hmmm, I "sent" this but there was some problem and it remained
unsent. Just in case it hasn't All Been Said Already, here's the
retry:
HenHanna wrote at 2024-5-30 13:03 -0700:
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program
that'd give me a list of all words occurring exactly once?
Your task can be split into several subtasks:
* parse the text into words
This depends on your notion of "word".
In the simplest case, a word is any maximal sequence of
non-whitespace characters. In this case, you can use `split` for
this task
This piece is by far "the hard part", because of the ambiguity. For
example, if I just say non-whitespace, then I get as distinct words
followed by punctuation. What about hyphenation - of which there's
both the compound word forms and the ones at the end of lines if the
source text has been formatted that way. Are all-lowercase words
different than the same word starting with a capital? What about
non-initial capitals, as happens a fair bit in modern usage with
acronyms, trademarks (perhaps not in Ulysses? :-) ), etc. What about
accented letters?
If you want what's at least a quick starting point to play with, you
could use a very simple regex - a fair amount of thought has gone
into what a "word character" is (\w), so it deals with excluding both punctuation and whitespace.
import re
from collections import Counter
with open("JoyceUlysses/txt", "r") as f:
wordcount = Counter(re.findall(r'\w+', f.read().lower()))
Now you have a Counter object counting all the "words" with their
occurrence counts (by this definition) in the document. You can fish
through that to answer the questions asked (find entries with a count
of 1, 2, 3, etc.)
Some people Go Big and use something that actually tries to recognize
the language, and opposed to making assumptions from ranges of
characters. nltk is a choice there. But at this point it's not
really "simple" any longer (though nltk experts might end up
disagreeing with that).
On 2024-06-03, Edward Teach via Python-list <[email protected]>
wrote:
The Gutenburg Project publishes "plain text". That's another
problem, because "plain text" means UTF-8....and that means
unicode...and that means running some sort of unicode-to-ascii
conversion in order to get something like "words". A couple of
hours....a couple of hundred lines of C....problem solved!
I'm curious. Why does it need to be converted frum Unicode to ASCII?
When you read it into Python, it gets converted right back to
Unicode...
...
The Gutenburg Project publishes "plain text". That's another problem, >because "plain text" means UTF-8....and that means unicode...and that
means running some sort of unicode-to-ascii conversion in order to get >something like "words". A couple of hours....a couple of hundred lines
of C....problem solved!
On 5/30/2024 2:18 PM, dn wrote:
On 31/05/24 08:03, HenHanna via Python-list wrote:
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program
that'd give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3 times
re: hyphenated words (you can treat it anyway you like)
but ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.
Split into words - defined as you will.
Use Counter.
Show some (of your) code and we'll be happy to critique...
hard to decide what to do with hyphens
and apostrophes
(I'd, he's, can't, haven't, A's and B's)
2-step-Process
1. make a file listing all words (one word per line)
2. then, doing the counting. using
from collections import Counter
On 31/05/24 14:26, HenHanna via Python-list wrote:
On 5/30/2024 2:18 PM, dn wrote:
On 31/05/24 08:03, HenHanna via Python-list wrote:
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program
that'd give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3
times
re: hyphenated words (you can treat it anyway you like)
but ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.
Split into words - defined as you will.
Use Counter.
Show some (of your) code and we'll be happy to critique...
hard to decide what to do with hyphens
and apostrophes
(I'd, he's, can't, haven't, A's and B's)
2-step-Process
1. make a file listing all words (one word per line)
2. then, doing the counting. using
from collections import Counter
Apologies for lateness - only just able to come back to this.
This issue is not Python, and is not solved by code!
If you/your teacher can't define a "word", the code, any code, will almost-certainly be wrong!
One of the interesting aspects of our work is that we can write all
manner of tests to try to ensure that the code is correct: unit tests, integration tests, system tests, acceptance tests, eye-tests, ...
However, there is no such thing as a test (or proof) that statements of requirements are complete or correct!
(nor for any other previous stages of the full project life-cycle)
As coders we need to learn to require clear specifications and not
attempt to read-between-the-lines, use our initiative, or otherwise 'not bother the ...'. When there is ambiguity, we should go back to the user/client/boss and seek clarification. They are the
domain/subject-matter experts...
I'm reminded of a cartoon, possibly from some IBM source, first seen in black-and-white but here in living-color: https://www.monolithic.org/blogs/presidents-sphere/what-the-customer-really-wants
That has been the sad history of programming and dev.projects - wherein
we are blamed for every short-coming, because no-one else understands
the nuances of development projects.
If we don't insist on clarity, are we our own worst enemy?
People need to see something to help them know what they really want.
Of course, we see this lack of clarity all the time in questions to the list. I often wonder how these askers can possibly come up with
acceptable code if they don't realize they don't truly know what it's supposed to do.
On Sat, Jun 8, 2024 at 10:39 AM Mats Wichmann via Python-list < [email protected]> wrote:
On 6/5/24 05:10, Thomas Passin via Python-list wrote:
Of course, we see this lack of clarity all the time in questions to the
list. I often wonder how these askers can possibly come up with
acceptable code if they don't realize they don't truly know what it's
supposed to do.
Fortunately, having to explain to someone else why something is giving
you trouble can help shed light on the fact the problem statement isn't
clear, or isn't clearly understood. Sometimes (sadly, many times it
doesn't).
The original question struck me as homework or an interview question for a junior position. But having no clear requirements or specifications is good training for the real world where that is often the case. When you question that, you are told to just do something, and then you’re told it’s not what
is wanted. That frustrates people but it’s often part of the process. People need to see something to help them know what they really want.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 34:29:51 |
| Calls: | 12,109 |
| Files: | 15,006 |
| Messages: | 6,518,342 |