Using python to analyze bots from apache logs

Apache logs contains pretty useful information about various visitors and bots coming to your site. Here is how a typical apache log entry looks like:

66.249.73.146 - - [09/Mar/2013:19:18:18 +0000] "GET / HTTP/1.1" 200 7266 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The last field if the User-agent field which tells us about the visitor. In this case it is Googlebot. Some fields contain spaces so parsing apache log requires some sort of csv parser. Here is quick python code snippet to find out which are top User-agents visiting your blog site:

cat /etc/httpd/logs/access_log.2013-03-10 | python -c "import csv,sys;f=csv.reader(sys.stdin, delimiter=' '); print '\n'.join([r[9] for r in f])" | sort | uniq -c | sort -rn

The logfile name may be different depending upon your setup.

Here is the outcome (I have picked only bot entries) for last few days of my apache log file:

    999 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    368 Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
    291 Mozilla/5.0 (compatible; MJ12bot/v1.4.3; http://www.majestic12.co.uk/bot.php?+)
    158 Mediapartners-Google
     98 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
     92 msnbot/2.0b (+http://search.msn.com/msnbot.htm)
     81 Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=9370706305395655573)
     61 msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

It is interesting to see that Google crawlers is lot more active than other search engine crawlers. That also suggests that Google results will be more comprehensive and fresh as compared to other engines.

Share this article: share on Google+ share on facebook share on linkedin tweet this submit to reddit

Comments

Click here to write/view comments