• Re: Unable to wget some pages

    From Dan Purgert@21:1/5 to Michael F. Stemper on Mon Mar 11 14:27:09 2024
    On 2024-03-11, Michael F. Stemper wrote:
    Late last week, a script that I have used for several years suddenly
    stopped working. Investigation showed that wget was failing to
    download some pages. A simplified version, showing the problem, is:

    $ cat ic
    uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
    wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
    $ . ./ic
    --2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
    Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
    HTTP request sent, awaiting response... 401 HTTP Forbidden

    Username/Password Authentication Failed.
    $

    Looking at the error message, one might think that this page/site
    requires user login credentials. However, the same URL works just
    fine in Firefox, with no login requested or required.

    Looks like the page *does* have a login button / javascript thing
    "somewhere" (at least I can see it when I open the page in lynx here).
    I'd imagine either

    (1) wget is respecting some robots.txt somewhere OR
    (2) wget is following that login link for some reason

    The "401" is the error code. The "HTTP Forbidden" is (for lack of a
    better word) "custom text" they're supplying. I've done similar where a
    HTTP upload process sends back "200 OK, Got it!" as a proof-of-sanity
    when scripting things with expect.

    --
    |_|O|_|
    |_|_|O| Github: https://github.com/dpurgert
    |O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Josef_M=C3=B6llers?=@21:1/5 to Dan Purgert on Mon Mar 11 17:08:59 2024
    On 11.03.24 15:27, Dan Purgert wrote:
    On 2024-03-11, Michael F. Stemper wrote:
    Late last week, a script that I have used for several years suddenly
    stopped working. Investigation showed that wget was failing to
    download some pages. A simplified version, showing the problem, is:

    $ cat ic
    uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
    wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
    $ . ./ic
    --2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
    Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
    Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
    HTTP request sent, awaiting response... 401 HTTP Forbidden

    Username/Password Authentication Failed.
    $

    Looking at the error message, one might think that this page/site
    requires user login credentials. However, the same URL works just
    fine in Firefox, with no login requested or required.

    Looks like the page *does* have a login button / javascript thing
    "somewhere" (at least I can see it when I open the page in lynx here).
    I'd imagine either

    (1) wget is respecting some robots.txt somewhere OR
    (2) wget is following that login link for some reason

    The "401" is the error code. The "HTTP Forbidden" is (for lack of a
    better word) "custom text" they're supplying. I've done similar where a
    HTTP upload process sends back "200 OK, Got it!" as a proof-of-sanity
    when scripting things with expect.

    Besides that ... is it on purpose that $uas is between single quotes, so
    won't get expanded? Double quotes are required because the user agent
    string has blanks (and parantheses), but single quotes are definitely
    wrong here!

    Josef "2cts" Möllers

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Michael F. Stemper on Mon Mar 11 12:46:50 2024
    On 3/11/2024 10:11 AM, Michael F. Stemper wrote:
    Late last week, a script that I have used for several years suddenly
    stopped working. Investigation showed that wget was failing to
    download some pages. A simplified version, showing the problem, is:

    $ cat ic
    uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
    wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
    $ . ./ic
    --2024-03-11 08:52:23--  https://www.marketwatch.com/investing/index/spx Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
    Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
    HTTP request sent, awaiting response... 401 HTTP Forbidden

    Username/Password Authentication Failed.
    $

    Looking at the error message, one might think that this page/site
    requires user login credentials. However, the same URL works just
    fine in Firefox, with no login requested or required.

    Despite this, I tried telling wget to provide empty username and
    password, with no observable change in results.

    On a purely cargo-cult basis, I tried some different user agent
    strings, with no effect.

    I searched on "401 HTTP Forbidden", only to find that there does
    not appear to be such an error. There is "401 Unathorized", and
    "403 Forbidden", but no such cross-breed.

    I looked briefly at the page source (in Firefox), but without a
    top-level design document, couldn't make head or tail of it.

    Does anybody have any suggestions on how to fix my problem and
    again automatically download this, and neighboring, pages?
    Almost like there is a mixing up at some point, of https://
    versus http:// in the operation. The website denying http:// access.

    Maybe at some point, the website used to redirect the http://
    attempt to https:// for you, and maybe it's not doing that
    any more ?

    Or perhaps wget has developed a defect in dining habits
    related to that aspect.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Purgert@21:1/5 to Michael F. Stemper on Tue Mar 12 09:24:24 2024
    On 2024-03-11, Michael F. Stemper wrote:
    On 11/03/2024 09.27, Dan Purgert wrote:
    On 2024-03-11, Michael F. Stemper wrote:
    Late last week, a script that I have used for several years suddenly
    stopped working. Investigation showed that wget was failing to
    download some pages. A simplified version, showing the problem, is:

    $ cat ic
    uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
    wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500 >>> $ . ./ic
    --2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx >>> Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
    Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
    HTTP request sent, awaiting response... 401 HTTP Forbidden

    Username/Password Authentication Failed.
    $

    Looking at the error message, one might think that this page/site
    requires user login credentials. However, the same URL works just
    fine in Firefox, with no login requested or required.

    Looks like the page *does* have a login button / javascript thing
    "somewhere" (at least I can see it when I open the page in lynx here).

    I've never installed lynx. Is it capable of running as a background
    process, e.g., via crontab?

    Not that I'm aware of, sorry.


    I'd imagine either

    (1) wget is respecting some robots.txt somewhere OR
    (2) wget is following that login link for some reason

    Any ideas how I could test for, or prevent, either of these?

    Potentially adding "-e robots=off" will avoid #1. More verbosity (-v) or turning on headers (-S?) may help for both as well.

    But both of these were a bit of a stab in the dark.


    --
    |_|O|_|
    |_|_|O| Github: https://github.com/dpurgert
    |O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)