I’ve recently been playing around with a perl script which sits in the /cgi-bin/ of a site.
The idea is that the script is to catch robots that are not obeying the robots.txt files found in the domains root. The robot.txt is a useful file (if obeyed) in that search engines and content spiders read what is allowed to be viewed or not. If a particular folder or file is disallowed in the robots.txt then they robot should in fact ignore the said file. The problem being that as the internet is full of leachers and badly behaving robots just out to leach information, we needed some way of being able to stop that practice.
After some diving and seeking I found a perl script ready to go. The script is great in that if triggered, the script will add the robot to a ban list in the site’s .htaccess file thus banning the IP from the site completely.
The way I’ve implemented this is to have a hidden link of the form:
<a href=”getout.php” onmouseover=”window.status=’Burglar Alarm’; return true;” onclick=”return false;”>
<img src=”../images_folder/oddly_named_graphic.gif” alt border=”0″ WIDTH=”1″ HEIGHT=”1″></a></td>
This link is not evident to a human surfer but a spider would find the file getout.php and try to read it. Little does the spider know however that getout.php is re-directed to my perl script, thus snared!
Within a couple hours of implementing this across our clients sites the script had sent me an e-mail informing that it had banned the first robot.
A great way to keep your sites bandwidth down and your server running fast.