Sqwertz <
[email protected]d> wrote:
I'm trying to filter out extraneous headers from this text file
which I've exported using File->Save Selected Messages->MBOX|TXT.
There are a couple thousand of messages in here and I'm trying to
make it more legible without all the visual noise of the headers.
No other long headers seem to do the CRLF thing except for
X-Received: Is this my obnoxious newserver (highwinds) doing this
and Dialog doesn't care?
I've been working on this for days off and on. Can anybody help me
delete all headers except for:
Newsgroups:
Date:
From:
Subject:
Message-ID:
(in their natural order, not how I've listed)
From the text file at:
https://drive.google.com/file/d/1ElDcN7rUvmy7kn6f3Sn78jz6YXwz-WhJ/view?usp=sharing
It's for a very good cause (the Missouri Board of Nursing in regards
to a paedo pediatric HOME CARE nurse)
Using notepad++ I've gotten rid of everything EXCEPT for those nasty X-Received: second lines and there's no pattern that won't remove
other context that I can figure - but my grepping and regex's are
really rusty.
Here's a sample of the text file to show my/our problem (more at the
link).
Thanks IA.
-sw
From [email protected] Mon Oct 04 05:37:30 2021
X-Folder: Kuthe
X-Received: by 2002:a37:688b:: with SMTP id d133mr9895201qkc.352.1633351051221;
Mon, 04 Oct 2021 05:37:31 -0700 (PDT)
X-Received: by 2002:a25:b84e:: with SMTP id b14mr15395553ybm.348.1633351051055;
Mon, 04 Oct 2021 05:37:31 -0700 (PDT)
Path: not-for-mail
Newsgroups: rec.food.cooking
Date: Mon, 4 Oct 2021 05:37:30 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=35.129.9.50; posting-account=ja_j6woAAABJv24pt7Dxx6icnyi92ahF
NNTP-Posting-Host: 35.129.9.50
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <[email protected]>
Subject: And I had a GREAT ORGASM yesterday!
From: John Kuthe <[email protected]>
Injection-Date: Mon, 04 Oct 2021 12:37:31 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1007
On Sunday, my DAY OFF! :-) Complete with ejaculation! Wow!
At 61 Years old! And it felt SO GOOD! :-)
John Kuthe, RN, BSN...
From [email protected] Sun Oct 03 18:34:10 2021
X-Folder: Kuthe
X-Received: by 2002:a0c:e381:: with SMTP id a1mr20159752qvl.42.1633311251669;
Sun, 03 Oct 2021 18:34:11 -0700 (PDT)
X-Received: by 2002:a25:3620:: with SMTP id d32mr12272072yba.46.1633311251515;
Sun, 03 Oct 2021 18:34:11 -0700 (PDT)
Path: not-for-mail
Newsgroups: rec.food.cooking
Date: Sun, 3 Oct 2021 18:34:11 -0700 (PDT)
In-Reply-To: <sjdl8i$9tg$[email protected]>
Injection-Info: google-groups.googlegroups.com; posting-host=35.129.9.50; posting-account=ja_j6woAAABJv24pt7Dxx6icnyi92ahF
NNTP-Posting-Host: 35.129.9.50
References: <[email protected]> <sjdl8i$9tg$[email protected]>
User-Agent: G2/1.0
Continuation lines are allowed for headers to accomodate those that are
long, sometimes exceeding the 998-character maximum per physical line.
headerName: string1
string2
string3
string2 and string3 are continuation lines.
Continuation lines are denoted by a leading space character. That is,
at a minimum, there must be a space character in column 1 of a header
line for it to be a continuation line. For a continuation line, it must
be prefixed with 1, or more, whitespace characters.
Nothing wrong with the Received header. It obeys the RFC standard for
Internet messages. The header section ends with the first blank line;
i.e., /n in column 1. Before that, your script would need to copy and
paste every continuation line to the preceding line to compose 1 long
header line as 1 physical line. Since you're throwing away the headers,
why keep anything before blank like delimiting the header section? Scan
(parse through) the message, and keep ignoring everything until, and
after, the first blank line your parser encounters.
If you want to keep some headers, you'll have to test each line on a
read to see if the header's name matches one of those you want to keep.
If so, you have to keep that line, and every continuation line
thereafter (ever following line with a space in column 1), until the
next line in the format:
headerName: string
^ ^
| |__ one whitespace minimum for parsing name from value
|__ must be in column 1 of a physical line
You'll need to write a parser script checking if each line is a header
line (headername:<space>), if that's one you want to keep, and if
following lines are continuation lines, or another header line, and
terminating the parsing upon reaching the first blank line.
Regex is handy, but I don't think you can get it to handle continuation
lines as part of the preceding header line.
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)