InfoHeap
Tech
Navigation
  • Home
  • Tutorials
    • CSS tutorials & examples
    • CSS properties
    • Javascript cookbook
    • Linux/Unix Command Line
    • Mac
    • PHP
      • PHP functions online
      • PHP regex
    • WordPress
  • Online Tools
    • Text utilities
    • Online Lint Tools
search

robots.txt

  • Online robots.txt check
  • wildcard
  • disallow all
  • resubmit robots.txt to Google
  • Googlebot and js, css crawl
  • robot noindex, follow for wordpress
  • Allow Google adsense bot
  • parse robots.txt using python
 
  • Home
  • > Tutorials
  • > Web Development
  • > SEO
  • > robots.txt

Using robotexclusionrulesparser python module to parse robots.txt and crawl a site

By admin | Last updated on Mar 15, 2016

Here is a small code snippet to demonstrate how to parse robots.txt and crawl a web page in python using robotexclusionrulesparser module

url = "http://www.foo.com/bar"
agent = "The name of your agent. You should mention you site name here with contact details"
o = urlparse.urlparse(url)
netloc = o[1]
robots_url = "http://" + netloc + "/robots.txt"
rp = robotexclusionrulesparser.RobotExclusionRulesParser()
rp.user_agent = agent
rp.fetch(robots_url)

You should do it only once in your program and rp can now be used to decide if a url can be crawled or not. Use the following code to decide if a url is allowed or not

rp.is_allowed(agent, url)

In addition you should not crawl a site too frequently. Having a big enough time interval between 2 crawls is the standard practice.

Suggested posts:

  1. WordPress mysql query to get all posts with a missing custom field
  2. Chrome extension tutorial – hello world
  3. Css :last-of-type selector – last child element of type
  4. How to display wordpress page list with specific custom field value
  5. Github tutorial for beginners
  6. node – how to find version of installed package
  7. Screenflow 4 export options
  8. How to disable dropbox photo auto upload on mobile and desktop
Share this article: share on facebook share on linkedin tweet this submit to reddit
Posted in Tutorials | Tagged Python, robots.txt, Tutorials, Web Development
  • Browse content
  • Article Topics
  • Article archives
  • Contact Us
Popular Topics: Android Development | AngularJS | Apache | AWS and EC2 | Bash shell scripting | Chrome developer tools | Company results | CSS | CSS cookbook | CSS properties | CSS Pseudo Classes | CSS selectors | CSS3 | CSS3 flexbox | Devops | Git | HTML | HTML5 | Java | Javascript | Javascript cookbook | Javascript DOM | jQuery | Kubernetes | Linux | Linux/Unix Command Line | Mac | Mac Command Line | Mysql | Networking | Node.js | Online Tools | PHP | PHP cookbook | PHP Regex | Python | Python array | Python cookbook | SEO | Site Performance | SSH | Ubuntu Linux | Web Development | Webmaster | Wordpress | Wordpress customization | Wordpress How To | Wordpress Mysql Queries | InfoHeap Money

Copyright © 2025 InfoHeap.

Powered by WordPress