InfoHeap
Tech tutorials, tips, tools and more
Navigation
  • Home
  • Tutorials
    • CSS tutorials & examples
    • CSS properties
    • Javascript cookbook
    • Linux/Unix Command Line
    • Mac
    • PHP
      • PHP functions online
      • PHP regex
    • WordPress
  • Online Tools
    • Text utilities
    • Online Lint Tools
search

robots.txt

  • Online robots.txt check
  • wildcard
  • disallow all
  • resubmit robots.txt to Google
  • Googlebot and js, css crawl
  • robot noindex, follow for wordpress
  • Allow Google adsense bot
  • parse robots.txt using python
 
  • Home
  • > Tutorials
  • > Web Development
  • > SEO
  • > robots.txt

Using robotexclusionrulesparser python module to parse robots.txt and crawl a site

By admin | Last updated on Mar 15, 2016

Here is a small code snippet to demonstrate how to parse robots.txt and crawl a web page in python using robotexclusionrulesparser module

url = "http://www.foo.com/bar"
agent = "The name of your agent. You should mention you site name here with contact details"
o = urlparse.urlparse(url)
netloc = o[1]
robots_url = "http://" + netloc + "/robots.txt"
rp = robotexclusionrulesparser.RobotExclusionRulesParser()
rp.user_agent = agent
rp.fetch(robots_url)

You should do it only once in your program and rp can now be used to decide if a url can be crawled or not. Use the following code to decide if a url is allowed or not

rp.is_allowed(agent, url)

In addition you should not crawl a site too frequently. Having a big enough time interval between 2 crawls is the standard practice.

Suggested posts:

  1. robots.txt disallow all example
  2. How to prevent your test blog from google crawling but show adsense ads
  3. How to use wildcard in robots.txt
  4. How to resubmit robots.txt to Google
  5. How to undo HTTP 301 site/domain redirect
  6. Should I block Googlebot from crawling javascript and css?
  7. How to view desktop site from android mobile
  8. How to monitor 404 pages on your site
Share this article: share on facebook share on linkedin tweet this submit to reddit
Posted in Tutorials | Tagged Python, robots.txt, Tutorials, Web Development

Follow InfoHeap

facebook
twitter
googleplus
  • Browse site
  • Article Topics
  • Article archives
  • Recent Articles
  • Contact Us
  • Omoney
Popular Topics: Android Development | AngularJS | Apache | AWS and EC2 | Bash shell scripting | Chrome developer tools | CSS | CSS cookbook | CSS properties | CSS Pseudo Classes | CSS selectors | CSS3 | CSS3 flexbox | Devops | Git | HTML | HTML5 | Java | Javascript | Javascript cookbook | Javascript DOM | jQuery | Kubernetes | Linux | Linux/Unix Command Line | Mac | Mac Command Line | Mysql | Networking | Node.js | Online Tools | PHP | PHP cookbook | PHP Regex | Python | Python array | Python cookbook | SEO | Site Performance | SSH | Ubuntu Linux | Web Development | Webmaster | Wordpress | Wordpress customization | Wordpress How To | Wordpress Mysql Queries

Copyright © 2023 InfoHeap.

Powered by WordPress