Some important points about robotx.txt
- One site should have one robots.txt and it should be in root directory. e.g. http://yoursite.com/robots.txt
- There must be exactly one User-Agent field per record
- Robot should try to do a substring match with case insensitive value of User-Agent value specified in robots.txt.
- Empty value of Disallow indicates all URIs can be retrieved. At least one Disallow field must be present in robots.txt
- "User-Agent: *" applies to a User-Agent if it does not have its own specific entry. Multiple "*" entries are not allowed.
- Python Robotexclusionrulesparser library used by this tool.
- HTML 4.01 specification (Section B.4.1 about search robots)
- Wikipedia Robots_exclusion_standard
- wordpress blog allow all
User-agent: * Disallow: /wp-admin/
- allow all robots.txt
User-agent: * Disallow:
- allow none robots.txt
User-agent: * Disallow: /
- allow all except one robots.txt
User-agent: * Disallow: User-agent: Somebadbot Disallow: /