rockee
New Member
- Messages
- 120
- Reaction score
- 0
- Points
- 0
THE HTACCESS APPROACH
I was going to write a comprehensive tutorial on the subject of banning scrapers, spy bots and other misbehaving robots that either don't follow the robots.text file instruction, completely ignore the robots.txt file altogether or read it then completely ignore it anyway and continue to scrape and spider your web site.
Clearly there are many such tutorials on the Internet and a Google search using all these key words below at once in a search query found more info than I could post here (the forum has a size limitation on posts):
bad bots spam bots online downloaders
There are many pages on this search to look through so don't just read the first page only as you may find some more goodies like a useful PHP-Nuke module, spider traps and country specific bad bots.
The most likely and seemingly easiest to follow, and which would have been very much the format I would have used here, is located at these addresses:
How to block spambots, ban spybots, and tell unwanted robots to go to hell.
List of Bad Bots
Blocking Bad Bots and Scrapers with .htaccess
htaccess guide - Blocking offline browsers and 'bad bots'
The list of sites to check out is many and varied but they all have very useful information if you want to squish some of these server hogging pests.
Because this is not a new subject you may find some of the information and sites a bit dated but the principle is worthy and especially if you can find some more recent list of bad bots or you are a prolific reader and analyzer of your site's log files.
One thing to remember also is that some of these bad bots actually hijack web sites and servers (zombies) so they can masquerade the Internet at will - log file analysis will allow you to perhaps spot these and maybe a common reference point that will allow an effective .htaccess entry.
THE ROBOTS TEXT FILE APPROACH
An alternative to the use of the .htaccess file is the robots.txt file but as I have outlined above its use is only relevant and effective if these bad bots read the dang thing and follow your instructions - most don't.
A good place to start and the authority on all matter relating to the robots.txt file is located here:
The Web Robots Pages at robotstxt.org
They include pages about these items below plus much more and they use a Previous/Next type of navigation system for easier reading and understanding:
Here is a useful link to Wikipedia relating to BotNet
I hope this article will be of use and please post back if you can add to it with your own current lists of known mischievous robots and experiences.
Regards,
Rocky
I was going to write a comprehensive tutorial on the subject of banning scrapers, spy bots and other misbehaving robots that either don't follow the robots.text file instruction, completely ignore the robots.txt file altogether or read it then completely ignore it anyway and continue to scrape and spider your web site.
Clearly there are many such tutorials on the Internet and a Google search using all these key words below at once in a search query found more info than I could post here (the forum has a size limitation on posts):
bad bots spam bots online downloaders
There are many pages on this search to look through so don't just read the first page only as you may find some more goodies like a useful PHP-Nuke module, spider traps and country specific bad bots.
The most likely and seemingly easiest to follow, and which would have been very much the format I would have used here, is located at these addresses:
How to block spambots, ban spybots, and tell unwanted robots to go to hell.
List of Bad Bots
Blocking Bad Bots and Scrapers with .htaccess
htaccess guide - Blocking offline browsers and 'bad bots'
The list of sites to check out is many and varied but they all have very useful information if you want to squish some of these server hogging pests.
Because this is not a new subject you may find some of the information and sites a bit dated but the principle is worthy and especially if you can find some more recent list of bad bots or you are a prolific reader and analyzer of your site's log files.
One thing to remember also is that some of these bad bots actually hijack web sites and servers (zombies) so they can masquerade the Internet at will - log file analysis will allow you to perhaps spot these and maybe a common reference point that will allow an effective .htaccess entry.
THE ROBOTS TEXT FILE APPROACH
An alternative to the use of the .htaccess file is the robots.txt file but as I have outlined above its use is only relevant and effective if these bad bots read the dang thing and follow your instructions - most don't.
A good place to start and the authority on all matter relating to the robots.txt file is located here:
The Web Robots Pages at robotstxt.org
They include pages about these items below plus much more and they use a Previous/Next type of navigation system for easier reading and understanding:
- database of robots
- robots.txt file checker
- robot related meta tag information
- IP look up
- how to get the best listing in search engines
Here is a useful link to Wikipedia relating to BotNet
I hope this article will be of use and please post back if you can add to it with your own current lists of known mischievous robots and experiences.
Regards,
Rocky