Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
download some pages. A simplified version, showing the problem, is:
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
On 2024-03-11, Michael F. Stemper wrote:
Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
download some pages. A simplified version, showing the problem, is:
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Looks like the page *does* have a login button / javascript thing
"somewhere" (at least I can see it when I open the page in lynx here).
I'd imagine either
(1) wget is respecting some robots.txt somewhere OR
(2) wget is following that login link for some reason
The "401" is the error code. The "HTTP Forbidden" is (for lack of a
better word) "custom text" they're supplying. I've done similar where a
HTTP upload process sends back "200 OK, Got it!" as a proof-of-sanity
when scripting things with expect.
Late last week, a script that I have used for several years suddenlyAlmost like there is a mixing up at some point, of https://
stopped working. Investigation showed that wget was failing to
download some pages. A simplified version, showing the problem, is:
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Despite this, I tried telling wget to provide empty username and
password, with no observable change in results.
On a purely cargo-cult basis, I tried some different user agent
strings, with no effect.
I searched on "401 HTTP Forbidden", only to find that there does
not appear to be such an error. There is "401 Unathorized", and
"403 Forbidden", but no such cross-breed.
I looked briefly at the page source (in Firefox), but without a
top-level design document, couldn't make head or tail of it.
Does anybody have any suggestions on how to fix my problem and
again automatically download this, and neighboring, pages?
On 11/03/2024 09.27, Dan Purgert wrote:
On 2024-03-11, Michael F. Stemper wrote:
Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
download some pages. A simplified version, showing the problem, is:
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500 >>> $ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx >>> Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Looks like the page *does* have a login button / javascript thing
"somewhere" (at least I can see it when I open the page in lynx here).
I've never installed lynx. Is it capable of running as a background
process, e.g., via crontab?
I'd imagine either
(1) wget is respecting some robots.txt somewhere OR
(2) wget is following that login link for some reason
Any ideas how I could test for, or prevent, either of these?
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 145:23:29 |
| Calls: | 12,089 |
| Calls today: | 2 |
| Files: | 15,000 |
| Messages: | 6,517,497 |