184.108.40.206 - - [18/Mar/2013:08:07:28 +0000] "GET /wp-content/themes/shell-master/media-queries.css?ver=0.1.1 HTTP/1.1" 200 1541 "http://infoheap.com/" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "V:infoheap.com t:20130318080728 D:875 -" 220.127.116.11 - - [18/Mar/2013:18:45:08 +0000] "GET /wp-content/plugins/contact-form-7/includes/js/scripts.js?ver=3.3.3 HTTP/1.1" 301 286 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "V:infoheap.com t:20130318184508 D:323 -"
- There may be things hidden in html. So just plain text analysis may not a good idea. A responsible search engine should crawl and interpret everything.
- I have even seen flash content being shown in search results. Its a good thing for flash content discovery.
So what should be in robots.txt? I think a good robots.txt (at least as a starting point) for a wordpress site is:
User-agent: * Disallow: /wp-admin/
Also see: Online robots.txt sandbox