Using robotexclusionrulesparser python module to parse robots.txt and crawl a site

Here is a small code snippet to demonstrate how to parse robots.txt and crawl a web page in python using robotexclusionrulesparser module

url = "http://www.foo.com/bar"
agent = "The name of your agent. You should mention you site name here with contact details"
o = urlparse.urlparse(url)
netloc = o[1]
robots_url = "http://" + netloc + "/robots.txt"
rp = robotexclusionrulesparser.RobotExclusionRulesParser()
rp.user_agent = agent
rp.fetch(robots_url)

You should do it only once in your program and rp can now be used to decide if a url can be crawled or not. Use the following code to decide if a url is allowed or not

rp.is_allowed(agent, url)

In addition you should not crawl a site too frequently. Having a big enough time interval between 2 crawls is the standard practice.

Share this article: share on Google+ share on facebook share on linkedin tweet this submit to reddit

Comments

Click here to write/view comments