XPost: alt.html
In comp.infosystems.www.misc, Ivan Shmakov <
[email protected]> wrote:
[email protected] writes:
[Cross-posting to news:comp.infosystems.www.misc as I feel that
this question has more to do with Web than HTML per se.]
:^)
I have a website organized as a large number (> 200,000) of pages.
It is hosted by a large Internet hosting company.
...
My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own laptop for viruses!
200k pages isn't that huge, and if static files on disk, as described in
a snipped out part, shouldn't be that hard to serve. Bandwidth may be an
issue, depending on how you are being charged. And on a shared system,
which I think you might have, your options for optimizing for massive
amounts of static files might be limited.
I'm happy anyway to reduce the bot activity. I don't mind having my
site indexed, but once or twice a year would be enough!
Some of the better search engines will gladly consult site map files
that give hints about what needs reindexing. See:
https://www.sitemaps.org/protocol.html
I see that there is a way to stop the Google Bot specifically. I'd
love it if I could do the opposite -- have *only* Google index my
site.
JFTR, I personally (as well as many other users who value their
privacy) refrain from using Google Search and rely on, say,
https://duckduckgo.com/ instead.
Yeah, Google only is an "all your eggs in one basket" route. I, too,
have been using DDG almost exclusively for several years.
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.
46.229.168.0-46.229.168.255 is:
netname: ADVANCEDHOSTERS-NET
Can't say I've heard of them.
We have also blocked the bots by adding the following entry
in robots.txt:-
User-agent: AhrefsBot
Yes, block them. Not a search engine, but a commercial SEO service.
https://ahrefs.com/robot
User-agent: MJ12bot
Eh, maybe block, maybe not. Seems to be real serach engine.
http://mj12bot.com/
User-agent: SemrushBot
Yes, block them. Not a search engine, but a commercial SEO service.
https://www.semrush.com/bot/
User-agent: YandexBot
Real Russian search engine.
https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml
User-agent: Linguee Bot
Real service, but dubious value to a webmaster.
http://www.botreports.com/user-agent/linguee-bot.shtml
All bots can be impersonated by other bots, so you can't be sure the User-Agent: will be the real identity of the bots. You can spend a lot
of time researching bots and the characteristics of real bot usage, eg hostnames or IP address ranges of legit bot servers.
Given the little I've seen here, I wonder if you have someone at
Advanced Hosters impersonating bots to suck your site down.
As long as the troublesome bots honor robots.txt (there're those
that do not; but then, the above won't work on them, either),
a more sane solution would be to limit the /rate/ the bots
request your pages for indexing, like:
### robots.txt
### Data:
## Request that the bots wait at least 3 seconds between requests. User-agent: *
Crawl-delay: 3
### robots.txt ends here
Except for Linguee, I think all of the bots listed above are
well-behaved and will obey robots.txt, but I don't know if they are all advanced enough to know Crawl-delay. Some of them explicitly state they
do, however.
This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope)
will be well within "acceptable use limits" of your hosting
company.
Only bot I've ever had to blacklist was a MSN bot that absolutely
refused to stop hitting one page over and over again a few years ago. I
used a server directive to shunt that one bot to 403 Forbidden errors.
Elijah
------
stopped worring about bots a long time ago
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)