Online robots.txt check
In case you are writing robots.txt and want to test it for a specific crawler/bot/user-agent, you can use this tool to test it.
How to use wildcard in robots.txt
By default Disallow entry in robots.txt is based on url prefix. To block every url beginning with /foo/ the following robots.txt can be used. User-agent: read more
robots.txt disallow all example
Sometime we need to block all robots from crawling a web site. This can be needed if you have a stage or sandbox website for read more
How to resubmit robots.txt to Google
An error in robots.txt can cause Google to stop crawling a section of your site. In such case it is good to update and resubmit read more
Should I block Googlebot from crawling javascript and css?
I noticed that google bot is crawling javascript and css regularly from by wordpress blog site. Here are some entries from my apache log: 66.249.75.66 read more
Meta robot noindex, follow for wordpress tags and category pages
In case you want search engines to either not index or not follow links on a page, you can use robot meta tag. The default read more
How to prevent your test blog from google crawling but show adsense ads
I guess most of us maintain a test or sandbox blog to try out new plugins and do experiments which we can’t afford on main read more
Using robotexclusionrulesparser python module to parse robots.txt and crawl a site
Here is a small code snippet to demonstrate how to parse robots.txt and crawl a web page in python using robotexclusionrulesparser module url = “http://www.foo.com/bar” read more