Deal With Spybots, Spambots and Scrapers

rockee

New Member
Messages
120
Reaction score
0
Points
0
THE HTACCESS APPROACH
I was going to write a comprehensive tutorial on the subject of banning scrapers, spy bots and other misbehaving robots that either don't follow the robots.text file instruction, completely ignore the robots.txt file altogether or read it then completely ignore it anyway and continue to scrape and spider your web site.

Clearly there are many such tutorials on the Internet and a Google search using all these key words below at once in a search query found more info than I could post here (the forum has a size limitation on posts):

bad bots spam bots online downloaders

There are many pages on this search to look through so don't just read the first page only as you may find some more goodies like a useful PHP-Nuke module, spider traps and country specific bad bots.

The most likely and seemingly easiest to follow, and which would have been very much the format I would have used here, is located at these addresses:

How to block spambots, ban spybots, and tell unwanted robots to go to hell.

List of Bad Bots

Blocking Bad Bots and Scrapers with .htaccess

htaccess guide - Blocking offline browsers and 'bad bots'


The list of sites to check out is many and varied but they all have very useful information if you want to squish some of these server hogging pests.


Because this is not a new subject you may find some of the information and sites a bit dated but the principle is worthy and especially if you can find some more recent list of bad bots or you are a prolific reader and analyzer of your site's log files.

One thing to remember also is that some of these bad bots actually hijack web sites and servers (zombies) so they can masquerade the Internet at will - log file analysis will allow you to perhaps spot these and maybe a common reference point that will allow an effective .htaccess entry.


THE ROBOTS TEXT FILE APPROACH
An alternative to the use of the .htaccess file is the robots.txt file but as I have outlined above its use is only relevant and effective if these bad bots read the dang thing and follow your instructions - most don't.


A good place to start and the authority on all matter relating to the robots.txt file is located here:

The Web Robots Pages at robotstxt.org

They include pages about these items below plus much more and they use a Previous/Next type of navigation system for easier reading and understanding:

  • database of robots
  • robots.txt file checker
  • robot related meta tag information
  • IP look up
  • how to get the best listing in search engines
The above site is worth a visit so you can get a handle on this robots.txt file and use it to your best advantage.

Here is a useful link to Wikipedia relating to BotNet


I hope this article will be of use and please post back if you can add to it with your own current lists of known mischievous robots and experiences.

Regards,
Rocky
 

tittat

Active Member
Messages
2,478
Reaction score
1
Points
38
One doubt... not related to this topic.

i have my .htaccess file with rewrites and a lot of other stuffs.
My question is if i have my .htaccess file too bigger,will that affect my "sites response time"?
 

rockee

New Member
Messages
120
Reaction score
0
Points
0
You would not notice any overhead from a large .htaccess file doing mod_rewrites or doing any of it's tasks - it is only a folder by folder very tiny extension of the server's httpd.conf file anyway, imagine the size of a hosting company like X10 Hosting and the huge server configuration files it uses, but you would not notice much overhead at the browser level at all from those conf files being parsed.

My .htaccess file is huge by normal standards and contains 80% mod_rewrites and there is no noticeable overhead, and in any case how would you measure that latency, if there is any at all?

The .htaccess file even with many entries and jobs to do is usually much less than 10k, most less even than 1k, and compared with a 30k web page or a 60k graphic image being served this .htaccess file would use only a flea bite of the server's resources in comparison.

Regards,
Rocky
 

tittat

Active Member
Messages
2,478
Reaction score
1
Points
38
You would not notice any overhead from a large .htaccess

thanxs rockee, this is what i wish to hear......


Any others have different opinion?
 

rockee

New Member
Messages
120
Reaction score
0
Points
0
If you want a definitive answer or an informed opinion, then you should post your question in a forum where the tech support staff frequent most, as they are the only people at X10 Hosting that can give you the correct answer in relation to their servers.

The parsing of .htaccess files in service by clients on my servers, before I retired, did not noticeably affect those servers - what did affect the servers was all massive amount of needless traffic from the bad bots and scrapers, which the .htaccess files and the measure in place at the servers effectively reduced.

Regards,
Rocky
 

tittat

Active Member
Messages
2,478
Reaction score
1
Points
38
whenever i read the comments of rockee i am forced to give him reputation points...and i did...
i will say
rockee will become famous soon.

Regards,
Subeesh
 

rockee

New Member
Messages
120
Reaction score
0
Points
0
Thank you kindly Subeesh, I can appreciate you hunger for knowledge as I too have been there and done that, but for the life of me, I still can't satisfy my hunger. ;)

Kindest regards and best wishes always,
Rocky
 
Top