robots.txt Tutorials and Examples

Online robots.txt check

In case you are writing robots.txt and want to test it for a specific crawler/bot/user-agent, you can use this tool to test it.

How to use wildcard in robots.txt

By default Disallow entry in robots.txt is based on url prefix. To block every url beginning with /foo/ the following robots.txt can be used. User-agent: read more

robots.txt disallow all example

Sometime we need to block all robots from crawling a web site. This can be needed if you have a stage or sandbox website for read more

How to resubmit robots.txt to Google

An error in robots.txt can cause Google to stop crawling a section of your site. In such case it is good to update and resubmit read more

Should I block Googlebot from crawling javascript and css?

I noticed that google bot is crawling javascript and css regularly from by wordpress blog site. Here are some entries from my apache log: 66.249.75.66 read more

Meta robot noindex, follow for wordpress tags and category pages

In case you want search engines to either not index or not follow links on a page, you can use robot meta tag. The default read more

How to prevent your test blog from google crawling but show adsense ads

I guess most of us maintain a test or sandbox blog to try out new plugins and do experiments which we can’t afford on main read more

Using robotexclusionrulesparser python module to parse robots.txt and crawl a site

Here is a small code snippet to demonstrate how to parse robots.txt and crawl a web page in python using robotexclusionrulesparser module url = “http://www.foo.com/bar” read more