Today: Girls and Reality TV, Stars' Awkward Photos, Hollywood Cradle Robbers

Web Crawlers' Traffic Signal - Robots.txt

[Web Crawlers' Traffic Signal - Robots.txt]
Web Crawlers are software programs that downloads and process information on the Internet by collecting links, visiting them and downloading the HTML source. Site owners can block particular or certain web crawlers from crawling their websites using a protocol defined in robots.txt file.

Polite Web crawlers always respect boundaries set in the robots.txt file. Technically, a web crawler can ignore instructions specified in robots.txt however it is advisable always to do so. Failure may result in your IP being blocked, crawler might be victim of crawler traps and so on.

[Requirement]
Your Web crawler needs to have component specifically designed for parsing robots.txt file. Fortunately for the Python users - there is already a library specifically designed to parse robots.txt for the given URL and notify your web crawler if it can crawl a particular URL or not.

[Python Library]
Python provides a module called RobotFileParser, which checks whether or not a particular user agent can fetch a URL on the Web site that published the robots.txt file.

The following code shows the basic usage of RobotFileParser library.



[Sample Code]
You can view the sample code and class information at: Python Robot File Parser Library
Advertisements
Comments
Zimbio Entertainment
Copyright © 2012 - Zimbio, Inc. Some rights reserved.
Share
. . .
Follow
. . .