Here is a small code snippet to demonstrate how to parse robots.txt and crawl a web page in python using robotexclusionrulesparser module
url = "http://www.foo.com/bar" agent = "The name of your agent. You should mention you site name here with contact details" o = urlparse.urlparse(url) netloc = o robots_url = "http://" + netloc + "/robots.txt" rp = robotexclusionrulesparser.RobotExclusionRulesParser() rp.user_agent = agent rp.fetch(robots_url)
You should do it only once in your program and rp can now be used to decide if a url can be crawled or not. Use the following code to decide if a url is allowed or not
In addition you should not crawl a site too frequently. Having a big enough time interval between 2 crawls is the standard practice.