Yahoo is raiding my forum!

Smith6612

I ate all of the x10Pizza
Community Support
Messages
6,517
Reaction score
48
Points
48
Look at what Yahoo is doing to my forum. I think I need to change my robots.txt file again.
 

Attachments

  • wtf.png
    wtf.png
    31.2 KB · Views: 67
D

dWhite

Guest
That's normal. Search engines have the ability to send out hundreds upon hundreds of spiders at the same time.
 

Smith6612

I ate all of the x10Pizza
Community Support
Messages
6,517
Reaction score
48
Points
48
I know it's normal :p My home server still gets raided by search engines all the time, which is like once a week.
 

rockee

New Member
Messages
120
Reaction score
0
Points
0
As you know, bandwidth is a precious commodity and not an infinite resource on free X10 Hosting.

So if you want to conserve some of that bandwidth by limiting your visitors to those humans that read and contribute to your forum, then you have 2 very strong tools to achieve this goal.

One you have mentioned is the robots.txt file but this is not always adhered to by those rude bots I call them, who, for log reading purposes, get the robots.txt file but then continue to spider your web site regardless of any restrictions they encounter in the robots.txt file, and I can assure you there are lots of those little beasties out there.

There are many examples out there of how to configure a robots.txt file to Disallow individual robots from accessing your site, also from individual files and folders, frequency of visits and a time gap between GET requests so as not to hog the available ports from your real visitors - if you need help with the robots.txt file then watch out for my next tutorial in the Tutorials Forum or do a Google for robots.txt.

So to really put the mockers on those or any non human robot who spider your site, then you can use the mod_rewrite directive in a .htaccess file in the web root of your site (public_html).

Here's how.
I have included these entries from my own .htaccess that is very successful in keeping my log files and my bandwidth under my control.

The list is quite comprehensive as it has been created and added to over the years and as such, some may cease to exist and those new faces on the block have yet to be added and will be if and when I see them in my site's log files.

You can pick and choose and add those that you feel ignore your robots.txt file and add those that don't even bother with the file at all which are mostly the spam bots looking for email addresses to add to a spammers joy.

This list has been, in my use, without issue for many years on just about every hosting service I have used, and incidentally owned, with most of them coming from my owned dedicated servers' log files.

So if you have any error issues with adding or editing your .htaccess file then check that you have not made a typo or a copy and paste error.

Make a backup of any existing .htaccess file before adding this list or parts of this list.

I can assure you also that it will not effect those human visitors you wish to have access to your site - but even they can be denied access, if they play up, in the same .htaccess file but by using a different directive.

Code:
RewriteEngine on 
RewriteBase /
# User-Agents with no privileges (mostly spambots/spybots/offline downloaders that ignore robots.txt)
RewriteCond %{REMOTE_ADDR} "^63\.148\.99\.2(2[4-9]|[3-4][0-9]|5[0-5])$" [OR] # Cyveillance spybot
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])$ [OR] # NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]|2[0-4][0-9]|25[0-5])$ [OR] # NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^64\.140\.49\.6([6-9])$ [OR] # Turnitin spybot
RewriteCond %{REMOTE_ADDR} ^216\.169\.(9[6-9]|1[01][0-9]|12[0-7])\. [OR] # rude bot
RewriteCond %{HTTP_REFERER} citylinkz\.com [NC,OR] # log spambot
RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR] # spambot
RewriteCond %{HTTP_REFERER} netfactual\.com [NC,OR] # rude bot
RewriteCond %{HTTP_REFERER} traffixer\.com [NC,OR] # log spambot
RewriteCond %{HTTP_REFERER} web\.ask\.com [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ZyBorg [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^[A-Z]+$ [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} anarchie [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} AOLserver-Tcl/3\.5\.6 [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} Atomz [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} cherry.?picker [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "compatible ; MSIE 6.0" [NC,OR] # spambot (note extra space before semicolon)
RewriteCond %{HTTP_USER_AGENT} crescent [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} "^DA \d\.\d+" [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} "DTS Agent" [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} "^Download" [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} EasyDL/\d\.\d+ [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} EmeraldShield [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} e?mail.?(collector|magnet|reaper|siphon|sweeper|harvest|collect|wolf) [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} express [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} extractor [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} "Fetch API Request" [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} flashget [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} FlickBot [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} FrontPage [NC,OR] # stupid user trying to edit my site
RewriteCond %{HTTP_USER_AGENT} getright [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} go.?zilla [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} "efp@gmx\.net" [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} Gigabot [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} Girafabot [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} grabber [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} grub [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} "Hosting Client" [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} HostItCheap [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} Hotbar [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} httrack [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} imagefetch [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} "Indy Library" [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "^Internet Explore" [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ^IE\ \d\.\d\ Compatible.*Browser$ [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} Larbin [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} "libwww-perl/5\.68" [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} "LINKS ARoMATIZED" [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} lwp-trivial [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} MediBot [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} "Microsoft URL Control" [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "^Microsoft-WebDAV-MiniRedir/5\.1\.2600$" [NC,OR] # unknown
RewriteCond %{HTTP_USER_AGENT} "mister pix" [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} "^Mozilla/4.0$" [NC,OR] # dumb bot
RewriteCond %{HTTP_USER_AGENT} "^Mozilla/\?\?$" [NC,OR] # formmail attacker
RewriteCond %{HTTP_USER_AGENT} MSIECrawler [NC,OR] # IE’s "make available offline" mode
RewriteCond %{HTTP_USER_AGENT} ^NG [NC,OR] # unknown bot
RewriteCond %{HTTP_USER_AGENT} "^obot$" [NC,OR] #
RewriteCond %{HTTP_USER_AGENT} offline [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} NaverRobot [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} net.?(ants|mechanic|spider|vampire|zip) [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} Netcraft [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} nicerspro [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ninja [NC,OR] # Download Ninja OD
RewriteCond %{HTTP_USER_AGENT} NPBot [NC,OR] # NameProtect spybot
RewriteCond %{HTTP_USER_AGENT} PersonaPilot [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} psbot [NC,OR] # image thief bot
RewriteCond %{HTTP_USER_AGENT} Scooter [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} semanticdiscovery [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} snagger [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} Sqworm [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} SurveyBot [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} tele(port|soft) [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} Teoma [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} T-H-U-N-D-E-R-S-T-O-N-E [NC,OR] # rudebot
RewriteCond %{HTTP_USER_AGENT} "Torrent Crawler" [NC,OR] # Rude Torrent Crawler 
RewriteCond %{HTTP_USER_AGENT} TurnitinBot [NC,OR] # Turnitin spybot
RewriteCond %{HTTP_USER_AGENT} twiceler [NC,OR] # experimental bot
RewriteCond %{HTTP_USER_AGENT} VoilaBot [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} web.?(auto|bandit|collector|copier|devil|downloader|fetch|hook|mole|miner|mirror|reaper|sauger|sucker|site|snake|stripper|weasel|zip) [NC,OR] # ODs
RewriteCond %{HTTP_USER_AGENT} vayala [NC,OR] # dumb bot, doesn’t know how to follow links, generates lots of 404s
RewriteCond %{HTTP_USER_AGENT} zeus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "^Mozilla/4\.0 compatible ZyBorg/1\.0 (wn\.zyborg@looksmart\.net; http://www\.WISEnutbot\.com)$" [NC] # rude bot
RewriteRule .* - [F,L]

If you wanted to add Yahoo for example, then to keep it in alphabetical order , just place it after this entry like so:
Code:
RewriteCond %{HTTP_USER_AGENT} vayala [NC,OR] # dumb bot, doesn't know how to follow links, generates lots of 404 error pages
RewriteCond %{HTTP_USER_AGENT} "Yahoo! Slurp" [NC,OR] # bandwidth hog bot

Do likewise if you want to add more that you know of.

Or if you just want to add Yahoo to get you started and to test the waters so to speak, and maybe you feel Google is the only worthy search engine to use (Yahoo looks like it will be swallowed up by the greedy Micro$oft entity anyway) then just add this to your .htaccess file which can be the last entry in the file, after all others, if you wish:

Code:
RewriteEngine on 
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} "Yahoo! Slurp" [NC,OR] # bandwidth hog bot
RewriteRule .* - [F,L]

I hope this helps to keep your log files and bandwidth under your control, it has for me, also I am in the process of creating a new tutorial for the Tutorials Forum but, as I pointed out, there are many bots that just choose to ignore your wishes and steal your bandwidth for their own greedy self interests, which is a pet hate of mine, especially if one is paying for that bandwidth.

Regards,
Rocky
 

Smith6612

I ate all of the x10Pizza
Community Support
Messages
6,517
Reaction score
48
Points
48
Thanks for the .htaccess tutorial. If I ever need to use that, I'll be sure to check back here. As for me running out of bandwidth, all I have to do is change my domain name to point to my home connection. With my nightly downloads of my database and weekly backups of the home directory, if something happens, I just open those backups, throw them on the server and change the DNS on the domain, and I'm all set.
 

Brandon

Former Senior Account Rep
Community Support
Messages
19,181
Reaction score
28
Points
48
Yahoo is always on my forums, usually 2-5 spiders min. I am not sure why, but they must have a lot.
 

Smith6612

I ate all of the x10Pizza
Community Support
Messages
6,517
Reaction score
48
Points
48
Yahoo probably does, as SMF 2.0 saw over a thousand cases of Yahoo being in my forum. That was quadruple the amount of Google. So yeah, there's two bots sitting in the forum right now, but it never leaves :p
 

tittat

Active Member
Messages
2,478
Reaction score
1
Points
38
http://forums.x10hosting.com/online.php?sort=username&order=asc&pp=20&who=spiders
check this page.
you can view the yahoo spiders on x10 forums.Hundreads of yahoo spiders are there at x10 forums too.This is a normal trend.

I will recommend you must not prevent yahoo from crawling your pages using .htaccess or robots.txt file.Because search engines are the door ways of your main traffic.Better you don't alter those spiders..

But you may prevent nasty or naughty spiders spidering your site.
 

Smith6612

I ate all of the x10Pizza
Community Support
Messages
6,517
Reaction score
48
Points
48
I wasn't planning on blocking the bots. My Home server gets hit with stuff like this every week and it hardly uses any bandwidth.
 

tittat

Active Member
Messages
2,478
Reaction score
1
Points
38
I wasn't planning on blocking the bots. My Home server gets hit with stuff like this every week and it hardly uses any bandwidth.

This is a right decision..... keep on working good luck.:drool:
 
Top